allenai / pawls

Software that makes labeling PDFs easy.
https://pawls.apps.allenai.org
Apache License 2.0
380 stars 74 forks source link

BUG: duplicated tokens should not be allowed in pdf_structure tokens list #186

Open CarloNicolini opened 2 years ago

CarloNicolini commented 2 years ago

In the pdfplumber preprocess pipeline I've found that duplicated tokens may exist. Specifically in the obtain_word_tokens of the pdfplumber.py file, one should put a .drop_duplicates before converting the dataframe to list.

word_tokens = df.apply(self.convert_to_pagetoken, axis=1).drop_duplicates(keep="first").tolist()

It can happen in some cases that the tokens from a PAWLS pdf structure appear duplicated and this messes up things a bit when indexing from the annotation file.