NIHOPA / NLPre

Python library for Natural Language Preprocessing (NLPre)
190 stars 34 forks source link

replace_acronyms should conjoin the original tokens as well #49

Closed thoppe closed 7 years ago

thoppe commented 7 years ago

Currently in replace_acronyms

non-Hodgkin lymphoma (NHL)

Gets turned into

non-Hodgkin lymphoma ( non_Hodgkin_lymphoma )

but it would be ideal to turn it into

non-Hodgkin_lymphoma ( non_Hodgkin_lymphoma )

otherwise downstream parsers (like replace_from_dict) can mangle this.

HarryBaker commented 7 years ago

Should it conjoin all instances of all phrases that have acronym's associated with them? So if it runs into non-Hodgkin lymphoma anywhere else in the corpus it would conjoin it?

thoppe commented 7 years ago

Not to the whole corpus, but yes to the whole document being passed in.

HarryBaker commented 7 years ago

Right, but within a single document should it recognize phrases that are associated with acronyms in other documents, based on the corpus wide counter?

So: Doc1 = "The Environmental Protection Agency (EPA) protects trees. non-Hodgkin lymphoma is bad." Doc2 = "My uncle has non-Hodgkin lymphoma (NHL)."

becomes

Doc1 = "The Environmental_Protection_Agency (Environmental_Protection_Agency) protects trees. non-Hodgkin_lymphoma is bad." Doc2 = "My uncle has non-Hodgkin_lymphoma (non-Hodgkin_lymphoma)."

Should it recognize that "non-Hodgkin lymphoma" in Doc1 is associated with a phrase, and thus should be tokenized? I think we have to, or else these phrases will be tokenized as something else during replace_from_dictionary.