Closed zuny26 closed 3 years ago
Hi @jelmervdl, we are working on filtering functionality to throw out machine-translated documents. Do you have any WARCs we can use for testing?
I hit an issue when going from tmx -> urls -> shards -> text -> warcs (looking at it… no wonder with that many in-between steps…). I'm just directly going from urls -> text -> warcs now but it's a bit slow searching through all urls split over so many files.
Here is a small warc with only icelandic that made it into the tmx: is-partial.warc.gz It's a very small subsection, but at least something to test with.
I also have a list of all warcs that contain some icelandic that made it into the tmx. I'm going through those to extract the records that belong to the tmx urls. The attached warc is a tiny start of that output.
I can also give you some original warcs that contain at least some icelandic somewhere in them, but with they're 1GB each and there are at least 10k of them (maybe up to about 70k I expect…). But let me know and I'll copy some over to Valhalla.
Thanks, the example you attached will do for now!
tag <tab> attribute <tab> value <tab> value ...
For example,meta name translation-stats
will filter out documents that contain:<meta name="translation-stats" ... >
. The filter functionality checks that the value ofname
attribute contains the specified string, not that they are equal. So, the filtermeta name translation
would also eliminate a document with<meta name="translation-stats" ...>
.--tag-filter
to pass the filters file