NIHOPA / NLPre

Python library for Natural Language Preprocessing (NLPre)
188 stars 34 forks source link

Reference tagger and remover #79

Closed thoppe closed 7 years ago

thoppe commented 7 years ago

Longer biomedical texts include references which often are concatenated with regular text. This module aims to either remove or partition out the references. For example

...  key feature in Drosophila3-5 and elegans(7).

...  key feature in Drosophila and elegans.

Add more examples as comments to this issue as they are identified.

thoppe commented 7 years ago

Here's a line copied directly from Wikipedia, should be a simple grammar fix

import nlpre
P = nlpre.seperate_reference()
text = '''There are at least eight distinct types of modifications found on histones (see the legend box on the top left of the figure). Enzymes have been identified for acetylation,[2] methylation,[3] demethylation,[4] phosphorylation,[5] ubiquitination,[6] sumoylation,[7] ADP-ribosylation,[8] deimination,[9][10] and proline isomerization.[11]'''

Which gives

There are at least eight distinct types of modifications found on histones (see the legend box on the top left of the figure) . Enzymes have been identified for acetylation,[2] methylation,[3] demethylation,[4] phosphorylation,[5] ubiquitination,[6] sumoylation,[7] ADP-ribosylation,[8] deimination,[9][10] and proline isomerization .

ie. none of the references have been removed.