Closed thoppe closed 7 years ago
Should it conjoin all instances of all phrases that have acronym's associated with them? So if it runs into non-Hodgkin lymphoma anywhere else in the corpus it would conjoin it?
Not to the whole corpus, but yes to the whole document being passed in.
Right, but within a single document should it recognize phrases that are associated with acronyms in other documents, based on the corpus wide counter?
So: Doc1 = "The Environmental Protection Agency (EPA) protects trees. non-Hodgkin lymphoma is bad." Doc2 = "My uncle has non-Hodgkin lymphoma (NHL)."
becomes
Doc1 = "The Environmental_Protection_Agency (Environmental_Protection_Agency) protects trees. non-Hodgkin_lymphoma is bad." Doc2 = "My uncle has non-Hodgkin_lymphoma (non-Hodgkin_lymphoma)."
Should it recognize that "non-Hodgkin lymphoma" in Doc1 is associated with a phrase, and thus should be tokenized? I think we have to, or else these phrases will be tokenized as something else during replace_from_dictionary.
Currently in
replace_acronyms
Gets turned into
but it would be ideal to turn it into
otherwise downstream parsers (like
replace_from_dict
) can mangle this.