bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

Add the detected encoding to the metainformation #58

Open nvanva opened 6 months ago

nvanva commented 6 months ago

Would be nice to save the original encoding of each document. This might be useful during the further pre-processing steps, e.g. for langid.