clulab / reach

Reach Biomedical Information Extraction
Other
96 stars 39 forks source link

Updated Lucene Indexers #763

Closed enoriega closed 2 years ago

enoriega commented 2 years ago

The file structure of PMC OA changed on Dec 21st 2021. The xml files are split into different directories. This PR updates the NxmlIndexer class to handle this new structure. Added a new class named PubmedAbstractIndexer to create an index of pubmed abstracts akin to the full-text OA index

MihaiSurdeanu commented 2 years ago

Nice, thank you @enoriega !

For future runs of this, can you please include a README in this directory to explain what steps need to happen to index: (a) all abstracts, and (b) all OA papers. Thanks!

@kwalcock : if this looks Ok to you, can you please merge?

enoriega commented 2 years ago

@kwalcock sorry about replying late. That print statement was already there and I found it useful when running it. Thanks for reviewing!