WET extractor: add identified natural language of text content

commoncrawl / ia-web-commons

Web archiving utility library

Apache License 2.0

9 stars 6 forks source link

WET extractor: add identified natural language of text content #22

Closed sebastian-nagel closed 4 years ago

sebastian-nagel commented 4 years ago

(see discussion in Common Crawl group)

The WARC metadata record "attached" to the response records contain the language of a web page identified by CLD2 (since August 2018). The content language(s) could be exposed as WARC-Identified-Content-Language in the WET headers as a simple comma-separated list of ISO-639-3 language codes same as given in the URL indexes.

sebastian-nagel commented 4 years ago

Implemented for CC-MAIN-2020-24 - the WET files now contain the detected language(s) in the WARC header "WARC-Identified-Content-Language":

...
WARC-Identified-Content-Language: fra
Content-Type: text/plain
Content-Length: 6110

Propositions et discussions
...

...
WARC-Identified-Content-Language: isl,eng
Content-Type: text/plain
Content-Length: 10494

Bananabrauð með Nutella – Ljúfmeti og lekkerheit
...

kargaranamir commented 2 months ago

Hmm, some of the pages also have metadata that shows what language it is written in. I wonder why CommonCrawl did not consider including that as well?

sebastian-nagel commented 2 months ago

Hi @kargaranamir, metadata from the HTML or the HTTP header and also the top-level domain are passed as additional hints to CLD2 to improve the accuracy of the identifier. The metadata might be wrong or there may be even contradicting metadata in the HTML and the HTTP header.

kargaranamir commented 2 months ago

@sebastian-nagel Thanks for your answer. True, but cld2 could also be wrong as it only supports 160 languages. I did not know there are also other things used as hints. If that's the case, then it's better. I was proposing this because the languages I was interested in are not among those 160 languages.

sebastian-nagel commented 2 months ago

Hi @kargaranamir, we are working on an improved language identifier, but it's too early about predictions when it is ready to land and also in which format we would release the annotations. I'd suggest to continue the discussion either on our Google group or on Discord - you find the links on the Common Crawl homepage. The question how to use HTML/HTTP metadata is a challenging one: it may lead to better results, esp. for low-resource languages or when distinguishing between closely related languages. However, metadata is notoriously noisy or often even wrong. If you have any pointers for "real-world" metadata of low-resource languages, please share them, it's highly appreciated. Thanks!

kargaranamir commented 2 months ago

Thanks @sebastian-nagel for your answer. I just joined the Discord. I will continue the discussion there.