BUG: duplicated tokens should not be allowed in pdf_structure tokens list

In the pdfplumber preprocess pipeline I've found that duplicated tokens may exist. Specifically in the obtain_word_tokens of the pdfplumber.py file, one should put a .drop_duplicates before converting the dataframe to list.

word_tokens = df.apply(self.convert_to_pagetoken, axis=1).drop_duplicates(keep="first").tolist()

It can happen in some cases that the tokens from a PAWLS pdf structure appear duplicated and this messes up things a bit when indexing from the annotation file.

allenai / pawls

BUG: duplicated tokens should not be allowed in pdf_structure tokens list #186