bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

Shunt robots.txt responses to separate warc #41

Closed jelmervdl closed 1 year ago

jelmervdl commented 1 year ago

Similar to how we treat pdfs.

Include 404s and other responses, we want to know whether there was a request or not since a non-response to a request will be interpreted as crawling is okay.