Closed AmyOlex closed 6 years ago
Wow, ok I figured out why we are not getting the dates in the metadata line. I don’t think this method was ever working correctly!!! In the hasYear() method we were using re to identify the substring that matched a formatted date pattern. The problem came on line 2510 of TimePhrase_to_Chrono where we were taking the matched string (which was correct) and splitting in by the formatted date punctuation: split_result = re.split("/-:", result) The issue was this part: "/-:" I tested it out and for some reason the order is important. After trying a bunch of things I found the order had to be '[/:-]'. This changed the line to: split_result = re.split('[/:-]', result) And now everything is working!!! I need to go through the code to identify where in the code we are still doing this. I bet it is affecting the month of day and day of month entities as well.
Nope, it was not an issue in the month and day method because I had already re-written those method another way.
Ack! It is still returning the wrong span...but I fixed it. When getting the relative span of the substring it was using the matched string and not the full text string that was being iterated over in the for loop. I changed it so that it gets the relative span with respect to the full text substring and not just the matched string.
We are now not hitting the formatted dates in the header line “[meta rev_date="04/01/2010" start_date="04/01/2010" rev="0004"]”. We were not getting the second date anyway, but now we are not getting either.
The reason we are not getting both dates is because the entire line is considered a temporal phrase: “rev_date="11/18/2010" start_date="11/18/2010" rev="0002"]” Thus, we are only getting the first match and not the second. The string “rev_date="11/18/2010"” is a single token that is marked as temporal because it has a date in it. Same thing for the other and they are consecutive, so we are getting them in the same phrase. We could add in something to replace all equal signs with a space, but this might mess up other areas. The rev=”0002” comes up because it is a 4 digit number. I might not worry about the second date right now, but we do need to capture at least the first one.