MartinoMensio / misinfome

1 stars 0 forks source link

Some errors and suggestions #2

Open isspek opened 4 years ago

isspek commented 4 years ago
isspek commented 4 years ago

Error samples

MartinoMensio commented 4 years ago

For the first point, we could use something like the solution proposed here https://stackoverflow.com/questions/45108293/find-country-from-full-domain-name to localise the IP (because just the TLD suffix isn't enough, e.g. .com is not specific to any region, and not all the website have a URL path indicating the region/country/language). But it wouldn't work for websites hosted in other states (we never know). But the best remains in my opinion content-based (full-text / title).

At the moment the experiments/extract_articles_from_urls.py script is using Goose3 that may fail with the turkish / german samples because of:

Let's wait for ESI API that lets us send URLs and get the analysis (announced in Madrid): they have fields for the full text, for the language and they also do automatic translations in English. They have surely more resources than Goose.