WenzhengZhang / EntQA

Pytorch implementation of EntQA paper
MIT License
61 stars 12 forks source link

Some issues in data preprocessing #10

Closed Lukeming-tsinghua closed 2 years ago

Lukeming-tsinghua commented 2 years ago

Hi, thanks for your amazing works in EntQA! I am currently working on adapting your work to a custom dataset. And I found following issues in the data preprocessing script that may raise some errors or misfunction.

https://github.com/WenzhengZhang/EntQA/blob/7b3cec51b23ecc8a3043fb005c7c2344b405e02f/preprocess_data.py#L238-L242

This char2token function actually forgets blank char when counting the mapping between characters and tokens. To fix this issue, the length should be char2token_list += [i] * (len(tok.replace("##", "")) + 1).

https://github.com/WenzhengZhang/EntQA/blob/7b3cec51b23ecc8a3043fb005c7c2344b405e02f/preprocess_data.py#L268-L269

This function looks have two weird parameters max_ent_length and pad_to_max_ent_length. I guess it should be max_length=args.max_length and padding='max_length'.

Hope this would provide some helps to future developers.