Closed AmyOlex closed 6 years ago
I fixed this by re-writing the code for the getWhitespaceTokens() method in utils.py. It now also does sentence tokenization to identify the last word of each sentence. Now, when the temporal expression phrase extractor finds that a token is the last token in a sentence it ends the temporal phrase and starts a new one. This has eliminated the false positives like in the example provided above.
File ID051_clinic_148 has the following text: "my notes from December.
Where "December" and "2" are separated by a newline. However the program doesn't seem to recognize that. Need to review this new-line code in the temporal phrase extraction algorithm to figure out what is going on.