Closed jelmervdl closed 3 years ago
I noticed that this implementation is not catching all text/plain records because of the zip detection. If the URL matches any of the extensions mentioned in https://github.com/bitextor/warc2text/blob/master/src/record.cc#L110 it will try to interpret the contents as a zip file, fail, and skip the record.
PDF processing generates these plain text documents. But I noticed that when I feed these warcs with just plain/text documents, parsing will fail and the document will be dropped.
This change does assume that there are generally no html documents (or anything else that goes through
processHTML
) served with text/plain; there's no heuristic for determining whether something is plain text or not.