bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30 stars 12 forks source link

Add timestamp processor #50

Closed cccntu closed 3 years ago

cccntu commented 3 years ago

A few notes for discussion:

Alternative solution: regex

I tried to use a simple regex to parse, and find the difference with this version, and here are some of the dates the regex did not parse.

r"(20[0-2][0-9]|19\d\d)[/\-](0?[1-9]|10|11|12)[/\-](0?[1-9]|[12]\d|3[01])"
/news/2018/sep/24/community-calendar-sept-25-2018/
/entertainment/lehigh-valley-music/mc-ent-elvis-costello-review-sands-bethlehem-event-center-20181103-story.html
/news/2017/jan/01/births-announcements-part-ii-january-1-2016/
/articles/view/20180624/gozo/organ-masterclasses.682695
/nation-world/hc-wp-trump-racist-ad-20181102-story.html
/16-Apr-2019/sc-adjourns-hearing-of-orange-line-metro-train-case-till-friday
/news/20130731/project-gets-new-life-planned-condo-development-would-be-first-in-six-years-in-southeast-volusia
/news/20070510/mccheadlocal-sports-briefsmcchead/1
/stories/2018/may/29/harvard-study-estimates-thousands-died-in-puerto-r/
/sports/baseball/cubs/ct-kyle-schwarber-home-run-ball-wrigley-20160411-story.html
/environment/climate-change/huge-blow-backtoback-bleaching-covers-twothirds-of-the-great-barrier-reef-20170406-gvewah.html
/sports/20190103/young-beaver-girls-team-edged-by-avonworth-in-key-section-clash
/sports/report/serie-a-beaten-inters-lead-cut-juventus-napoli-win-easy/20151221.htm
/news/2017/apr/18/stringers-its-really-university-city-after-burner/
/sport/motorsport/stars-driven-around-the-bend-before-whincup-claims-pole-position-20180825-p4zzrc.html
/news/datablog/2011/oct/12/afghanistan-nato-kill-capture-raids-isaf-petraeus
/news/20071225/mccheadlocal-news-briefsmcchead/1
/business/realestate/hot-property/la-fi-hotprop-ellen-pompeo-hollywood-20171220-story.html
/sports/baseball/yankees/ny-sports-dallas-keuchel-yankees-free-agency-20181111-story.html
/news/politics/national/story/2015/may/24/fleischmann-already-pulling-big-dough/306023/
/national/western-australia/measles-alert-issued-for-perth-universities-public-transport-routes-20190201-p50v7t.html
/recap-11-05-2018-26629.html
/20121003/cleveland-county-building-permits/310039786
/archives/la-xpm-2010-jun-10-la-sp-0611-usc-ncaa-appeal-20100611-story.html
/archives/la-xpm-1998-sep-01-me-18472-story.html

That's ~16% less dates. The urls are from this dataset: https://huggingface.co/datasets/bs-modeling-metadata/c4_newslike_url_only

cccntu commented 3 years ago

Another point for discussion:

During parsing there are a few exceptions that I capture with try, but I am not sure if there would be any more exception. Maybe it'd be worthwhile to use a more complex regex.

(The regex originally comes from @tianjianjiang, added as reviewer for discussion.)

timoschick commented 3 years ago

Regarding regex vs dateutil, I'd definitely go with the latter if it is reasonably fast and the regex is not able to detect all the dates that you've listed.