bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

Filter out documents that contain tags that could indicate machine translation #4

Closed zuny26 closed 3 years ago

zuny26 commented 3 years ago
zuny26 commented 3 years ago

Hi @jelmervdl, we are working on filtering functionality to throw out machine-translated documents. Do you have any WARCs we can use for testing?

jelmervdl commented 3 years ago

I hit an issue when going from tmx -> urls -> shards -> text -> warcs (looking at it… no wonder with that many in-between steps…). I'm just directly going from urls -> text -> warcs now but it's a bit slow searching through all urls split over so many files.

Here is a small warc with only icelandic that made it into the tmx: is-partial.warc.gz It's a very small subsection, but at least something to test with.

I also have a list of all warcs that contain some icelandic that made it into the tmx. I'm going through those to extract the records that belong to the tmx urls. The attached warc is a tiny start of that output.

I can also give you some original warcs that contain at least some icelandic somewhere in them, but with they're 1GB each and there are at least 10k of them (maybe up to about 70k I expect…). But let me know and I'll copy some over to Valhalla.

zuny26 commented 3 years ago

Thanks, the example you attached will do for now!