Filter out documents that contain tags that could indicate machine translation

zuny26 commented 3 years ago

Format of the filters have the following format: tag <tab> attribute <tab> value <tab> value ... For example, meta name translation-stats will filter out documents that contain: <meta name="translation-stats" ... >. The filter functionality checks that the value of name attribute contains the specified string, not that they are equal. So, the filter meta name translation would also eliminate a document with <meta name="translation-stats" ...>.
The documents that contain filter tags are not outputted, and a message with the URL of the filtered record is shown on the console
--tag-filter to pass the filters file

zuny26 commented 3 years ago

Hi @jelmervdl, we are working on filtering functionality to throw out machine-translated documents. Do you have any WARCs we can use for testing?

jelmervdl commented 3 years ago

I hit an issue when going from tmx -> urls -> shards -> text -> warcs (looking at it… no wonder with that many in-between steps…). I'm just directly going from urls -> text -> warcs now but it's a bit slow searching through all urls split over so many files.

Here is a small warc with only icelandic that made it into the tmx: is-partial.warc.gz It's a very small subsection, but at least something to test with.

I also have a list of all warcs that contain some icelandic that made it into the tmx. I'm going through those to extract the records that belong to the tmx urls. The attached warc is a tiny start of that output.

I can also give you some original warcs that contain at least some icelandic somewhere in them, but with they're 1GB each and there are at least 10k of them (maybe up to about 70k I expect…). But let me know and I'll copy some over to Valhalla.

zuny26 commented 3 years ago

Thanks, the example you attached will do for now!

bitextor / warc2text

Filter out documents that contain tags that could indicate machine translation #4