AmyOlex / Chrono

Parsing time normalizations from text.
GNU General Public License v3.0
15 stars 4 forks source link

[Bug] Cleaning Text #29

Closed maffeyl closed 6 years ago

maffeyl commented 6 years ago

From the THYME train set, doc0056_CLIN has a "2006." that isn't converted to int because it has a period at the end. Other punctuation is removed on line 2489 of TimePhrase_to_Chrono in hasYear before it's passed back to create the year entity. Going to try removing the period in my own branch and see how it goes.

maffeyl commented 6 years ago

Stripping periods in hasYear does solve this particular issue, but I'm not sure if it will reduce our ability to identify years, I can't think of how it would, but more rigorous thought is warranted.

maffeyl commented 6 years ago

Also got this error ValueError: invalid literal for int() with base 10: '2012**note' on doc0178_CLIN

AmyOlex commented 6 years ago

Also got this error for file ID004_clinic_012: ValueError: invalid literal for int() with base 10: '2010"' And again for file ID004_path_011, ID181_clinic_529...actually, all of them have this error! This is the doc time in the metadata line. This needs to be fixed in our code....working on that now.

AmyOlex commented 6 years ago

By utilizing the .group(0) in the hasYear() method and returning the matched string instead of the original string I think I fixed this error.