bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30 stars 12 forks source link

Updates #164

Closed cccntu closed 1 year ago

cccntu commented 1 year ago

@tianjianjiang I'm not sure what's the status of the timestamp metadata. Do we need to do something extra after json.load? Or is it already solved in the master branch and we just need a merge?

tianjianjiang commented 1 year ago

@tianjianjiang I'm not sure what's the status of the timestamp metadata. Do we need to do something extra after json.load? Or is it already solved in the master branch and we just need a merge?

@cccntu Do you mean the fix of Unix Epoch in ms? That is already merged, although the process is still on-the-fly (after json.load(), if that's what you mean).

There were minor issues I mentioned a while back (three meetings before, as far as I can recall), but then (early this month) I realized that those issues were rare (see the footnote) and we probably wouldn't want to change the content of the dataset again. Therefore, the only thing that might still worth doing but not mandatory at all is to ensure timestamps and other values are physically using consistent data types across jsonl.gz files, such that no more special treatments after json.load().


Rare issues of timestamp First, I was wrong about the default month/day, because I missed the point that @cccntu 's version of dateutil has an additional flag to turn that off. Therefore, wrong timestamps are extremely rare. I haven't done a thorough check, but one of my heuristics indicates that, from pq00-* to pq01-017, only two timestamps are inaccurate, and they are reasonable, because their URL paths have unusual traits, as shown below.