CAMeL-Lab / camel_tools

A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
MIT License
409 stars 73 forks source link

Long text clipped when disambiguated by BERT #145

Open ahmadabousetta opened 5 months ago

ahmadabousetta commented 5 months ago

https://github.com/CAMeL-Lab/camel_tools/blob/b496501590ee0753eeb3686037fffeb12f4c80d2/camel_tools/disambig/bert/unfactored.py#L177

Ref line assumes the new batch is from a new sentence. Which is fine when trying to predict a list of short text sentences. However, if we pass a single very long text, the dataloader will split the text into batches. And since the input is only one sentence, only the predictions of the first batch will be returned. In my case, only 13309 out of 16949 tokens.

Fixing this issue should be done with care as this function is called also to predict a list of sentences.

owo commented 2 months ago

Agreed, we shouldn't be truncating output regardless of how long the input is. We'll look into a good way of doing this without losing accuracy.