Closed cccntu closed 1 year ago
@tianjianjiang I'm not sure what's the status of the timestamp metadata. Do we need to do something extra after json.load? Or is it already solved in the master branch and we just need a merge?
@cccntu Do you mean the fix of Unix Epoch in ms? That is already merged, although the process is still on-the-fly (after json.load()
, if that's what you mean).
There were minor issues I mentioned a while back (three meetings before, as far as I can recall), but then (early this month) I realized that those issues were rare (see the footnote) and we probably wouldn't want to change the content of the dataset again. Therefore, the only thing that might still worth doing but not mandatory at all is to ensure timestamps and other values are physically using consistent data types across jsonl.gz files, such that no more special treatments after json.load()
.
Rare issues of timestamp
First, I was wrong about the default month/day, because I missed the point that @cccntu 's version of dateutil
has an additional flag to turn that off.
Therefore, wrong timestamps are extremely rare. I haven't done a thorough check, but one of my heuristics indicates that, from pq00-*
to pq01-017
, only two timestamps are inaccurate, and they are reasonable, because their URL paths have unusual traits, as shown below.
les+technoperes+tome+1+la+pre+ecole+techno+de+alexandro+jodorowsky+15+avril+1998
mercruiser+service+manual+09+mercury+marines+gm+v+8+cylinder+1987+1988
@tianjianjiang I'm not sure what's the status of the timestamp metadata. Do we need to do something extra after json.load? Or is it already solved in the master branch and we just need a merge?