Hi, thanks for your amazing works in EntQA! I am currently working on adapting your work to a custom dataset. And I found following issues in the data preprocessing script that may raise some errors or misfunction.
This char2token function actually forgets blank char when counting the mapping between characters and tokens. To fix this issue, the length should be char2token_list += [i] * (len(tok.replace("##", "")) + 1).
This function looks have two weird parameters max_ent_length and pad_to_max_ent_length. I guess it should be max_length=args.max_length and padding='max_length'.
Hope this would provide some helps to future developers.
Hi, thanks for your amazing works in EntQA! I am currently working on adapting your work to a custom dataset. And I found following issues in the data preprocessing script that may raise some errors or misfunction.
https://github.com/WenzhengZhang/EntQA/blob/7b3cec51b23ecc8a3043fb005c7c2344b405e02f/preprocess_data.py#L238-L242
This char2token function actually forgets blank char when counting the mapping between characters and tokens. To fix this issue, the length should be
char2token_list += [i] * (len(tok.replace("##", "")) + 1)
.https://github.com/WenzhengZhang/EntQA/blob/7b3cec51b23ecc8a3043fb005c7c2344b405e02f/preprocess_data.py#L268-L269
This function looks have two weird parameters
max_ent_length
andpad_to_max_ent_length
. I guess it should bemax_length=args.max_length
andpadding='max_length'
.Hope this would provide some helps to future developers.