Open isspek opened 4 years ago
For the first point, we could use something like the solution proposed here https://stackoverflow.com/questions/45108293/find-country-from-full-domain-name to localise the IP (because just the TLD suffix isn't enough, e.g. .com
is not specific to any region, and not all the website have a URL path indicating the region/country/language). But it wouldn't work for websites hosted in other states (we never know). But the best remains in my opinion content-based (full-text / title).
At the moment the experiments/extract_articles_from_urls.py
script is using Goose3 that may fail with the turkish / german samples because of:
<article>
elementLet's wait for ESI API that lets us send URLs and get the analysis (announced in Madrid): they have fields for the full text, for the language and they also do automatic translations in English. They have surely more resources than Goose.