Closed minump closed 8 months ago
Pushed code to remove ref_spans and cite_spans. This will have the output text file (.txt) without the ref_spans and cite_spans. This will make the text content different from the original pdf text content. The model might perform better with the cleaned text (without the ref_spans and cite_spans). The tei.xml file will also need modifications to remove the references tag (ref). During the matching, the "_predicted.csv" file will have the cleaned text, which will get matched with sentences from the cleaned tei.xml file.
Closing as this is not needed now.
The reference numbers (citations) are present in the .txt file. These reference numbers appear as belonging to a sentence in the .txt file. The json file looks like the below:
Here the "text" field includes "cite_spans" and "ref_spans". This needs to be removed from the "text" field.