CurationCorp / curation-corpus

Code for obtaining the Curation Corpus abstractive text summarisation dataset
Creative Commons Attribution 4.0 International
123 stars 27 forks source link

Web Archive Links #1

Closed trtm closed 4 years ago

trtm commented 4 years ago

One way to keep the dataset consistent over time would be to check (or create a new entry) at the Wayback Machine of the Web Archive ( https://web.archive.org/ )

tomjennings100 commented 4 years ago

Hey trtm, thanks for your note.

If the request fails during scrape, we do fall back to the Wayback Machine (https://github.com/CurationCorp/curation-corpus/blob/master/web_scraper.py#L17). I'm reluctant to use it for every request because I don't want to hammer them.