bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

Don't parse text/plain as HTML #31

Closed jelmervdl closed 3 years ago

jelmervdl commented 3 years ago

PDF processing generates these plain text documents. But I noticed that when I feed these warcs with just plain/text documents, parsing will fail and the document will be dropped.

This change does assume that there are generally no html documents (or anything else that goes through processHTML) served with text/plain; there's no heuristic for determining whether something is plain text or not.

jelmervdl commented 3 years ago

I noticed that this implementation is not catching all text/plain records because of the zip detection. If the URL matches any of the extensions mentioned in https://github.com/bitextor/warc2text/blob/master/src/record.cc#L110 it will try to interpret the contents as a zip file, fail, and skip the record.