Any one who is interested in this PR is welcome to review even when it is still a draft. Thank you!
Background
OpenWebText is mostly for a comparable baseline to GPT-2 for HTML. In other words, if we don't need such a benchmark, any part of C4 with HTML would do.
Either way, we need HTML from WARC of Common Crawl (CC).
Status
No actual changes yet, just a plan and some caveats.
[ ] Get every CC record as fast as possible
[ ] Parse OpenWebText URL filenames for their year and month
[ ] Use year-month as from_ts and to for narrowing the CC indexes down (see the 1st snippet below)
Otherwise CC Indexer API will look for the most recent records. Not only a waste of time but also confusing when there are same-URL records with updated or even disappeared (page moved) contents.
[ ] Run queries in parallel (but not too much to violate CC's policy)
[ ] Extract HTML from each CC record's WARC on Jean Zay
The library is cdx_toolkit because other famous choices are either inconvenient (not library but command line) or insufficient (e.g., michaelharms/comcrawl is usually faster but no filters for 301).
HTML of https://www.defense.gov/News/News-Releases/News-Release-View/Article/621692/dod-identifies-air-force-casualties/
```html
DOD Identifies Air Force Casualties > U.S. DEPARTMENT OF DEFENSE > News Release View
```
related: #58 #59 #60
Any one who is interested in this PR is welcome to review even when it is still a draft. Thank you!
Background
OpenWebText is mostly for a comparable baseline to GPT-2 for HTML. In other words, if we don't need such a benchmark, any part of C4 with HTML would do. Either way, we need HTML from WARC of Common Crawl (CC).
Status
No actual changes yet, just a plan and some caveats.
from_ts
andto
for narrowing the CC indexes down (see the 1st snippet below)warcio
to extract HTMLTentative code snippets
For a test case of https://www.defense.gov/News/News-Releases/News-Release-View/Article/621692/dod-identifies-air-force-casualties/ Based on the corresponding file of the OpenWebText URL list, we already know that the URL is from 2018-08.
The library is
cdx_toolkit
because other famous choices are either inconvenient (not library but command line) or insufficient (e.g.,michaelharms/comcrawl
is usually faster but no filters for 301).The expected outcome in this case
HTML of https://www.defense.gov/News/News-Releases/News-Release-View/Article/621692/dod-identifies-air-force-casualties/
```html