daadaada / turingas

Assembler for NVIDIA Volta and Turing GPUs
MIT License
201 stars 40 forks source link

Read barrier index and Write barrier index should be 0~5 rather than 1~6 #8

Closed apuaaChen closed 3 years ago

apuaaChen commented 3 years ago

Hi. I find that there might be a small typo in the grammar.py related to the read & write barrier index.

At line 389, the original code is ctrl_re = r'(?P<ctrl>[0-9a-fA-F\-]{2}:[1-6\-]:[1-6\-]:[\-yY]:[0-9a-fA-F\-])', which suggests that the barriers are indexed by 1~6. However, I found that there is no corresponding Wait barrier mask for barrier index 6.

After checking some sass code returned by the cuobjdump, I observed that barrier index 0 is usually used. So maybe its better to replace line 389 with ctrl_re = r'(?P<ctrl>[0-9a-fA-F\-]{2}:[0-5\-]:[0-5\-]:[\-yY]:[0-9a-fA-F\-])'.

scott-gray commented 3 years ago

A bit of context: in maxas I remapped the barrier values to a 1 based index to make the control code notation a bit easier to read and work with.

daadaada commented 3 years ago

@scott-gray Thanks for the context :p @apuaaChen Does this solve your problem?

apuaaChen commented 3 years ago

@scott-gray Thanks for the context :p @apuaaChen Does this solve your problem?

Hi! Thank you for your reply. However, in the example there is

--:-:1:-:2     LDG.E tmp, [input_0];
02:-:-:-:2     STG.E [output_0], tmp;

It seems that barrier index 1 is associated with the second bit in the wait barrier mask. So if the index were 1-based, this would be a little counter-intuitive. By the way, I also tried

--:-:0:-:2     LDG.E tmp, [input_0];
01:-:-:-:2     STG.E [output_0], tmp;

after the modification, which provides the correct result.

daadaada commented 3 years ago

I think your suggestion makes sense. I have made the barrier index 0-based. Thanks for the feedback.

apuaaChen commented 3 years ago

I think your suggestion makes sense. I have made the barrier index 0-based. Thanks for the feedback.

Thank you!