Closed cccntu closed 3 years ago
Another point for discussion:
During parsing there are a few exceptions that I capture with try
, but I am not sure if there would be any more exception.
Maybe it'd be worthwhile to use a more complex regex.
(The regex originally comes from @tianjianjiang, added as reviewer for discussion.)
Regarding regex
vs dateutil
, I'd definitely go with the latter if it is reasonably fast and the regex is not able to detect all the dates that you've listed.
A few notes for discussion:
dateutil
inside this repo, so there is no risk of messing up the dependency withpip install -e
import dateutil
and relative import indateutil
, but that doesn't seem to be an issue, so I left it that way.Alternative solution: regex
I tried to use a simple regex to parse, and find the difference with this version, and here are some of the dates the regex did not parse.
That's ~16% less dates. The urls are from this dataset: https://huggingface.co/datasets/bs-modeling-metadata/c4_newslike_url_only