aehrc / cvt2distilgpt2

Improving Chest X-Ray Report Generation by Leveraging Warm-Starting
GNU General Public License v3.0
64 stars 6 forks source link

some questions about the gpt tokenizer. #12

Closed douseful closed 1 year ago

douseful commented 1 year ago

https:github.com/aehrc/cvt2distilgpt2/blob/48aa7fd40fd23614ecb2bf63c4c639d3b418cb0b/tools/dataset/dataset.py#L89C2-L114C28

image

Please could you tell me why we have to manually add the start and end marks to the report, and then when selecting the attention mask, the first element is discarded (corresponding to BOS) but the last element is not discarded? And why do the we need to throw away the first element of the decoder_input_ids?

anicolson commented 1 year ago

Hi douseful,

The BOS token is added manually as the GPT2 tokeniser does not add it.

Notice that [:-1] corresponds to the last element being thrown away, not the first.

If we through away the last element of the attention mask, we would be discarding padding. As we are shortening the input and output sequence by one token due to teacher forcing, we want to discard an element of the attention mask that corresponds to the tokens and not the padding, hence, the first element is perfect. In the end, this does not matter as we are using causal attention masking.

Hope this helps, Aaron.