Data collection for different languages

deepset-ai / COVID-QA

API & Webapp to answer questions about COVID-19. Using NLP (Question Answering) and trusted data sources.

Apache License 2.0

343 stars 119 forks source link

Data collection for different languages #2

Open andra-pumnea opened 4 years ago

andra-pumnea commented 4 years ago

Find official data sources for FAQ about COVID-19 in different languages and scrape them.

borhenryk commented 4 years ago

I have already a script for the RKI FAQ. Will share it later!

bogdankostic commented 4 years ago

If someone needs a starting point, I already wrote scrapers for WHO and some pages of CDC: https://github.com/deepset-ai/COVID-QA/tree/master/data/scrapers

andra-pumnea commented 4 years ago

I will do some scraping for Romanian

stedomedo commented 4 years ago

I'll add Italian

tkh42 commented 4 years ago

I will look into some more german pages.

borhenryk commented 4 years ago

@tkh42 let me know which so we are not doing double-work. This would make sense probably https://www.infektionsschutz.de/coronavirus/faqs-coronaviruscovid-19.html

tkh42 commented 4 years ago

@HenrykBorzymowski Ok. Yes I have thought about doing that one too, I think I will start with https://www.bmas.de/DE/Presse/Meldungen/2020/corona-virus-arbeitsrechtliche-auswirkungen.html

Timoeller commented 4 years ago

Perfect people, this is taking off rather quickly :D I can invite you to our slack crawler group if you tell me your wirvsvirus slack names.

I would also suggest that you create small issues stating on which website you want to work on, so we do not have double work or do a crawler twice. state the website in the title so github can find related issues very easily! Thanks

borhenryk commented 4 years ago

Here is a google table in which we can track which pages we already have a scraper for etc. Please fill in and change if necessary: https://docs.google.com/spreadsheets/d/1er-7sDvgMZ484FRhPL7X6rl1fgRIRtA7fJfj-gLp3jg/edit?usp=sharing

Timoeller commented 4 years ago

@tkh42 Can I somehow help or motivate you creating scrapers for German Sites? :D

We already started the label process and need more questions!

tkh42 commented 4 years ago

@Timoeller I am finished with the BMAS one will create the pull request and continue with the next.:)

stedomedo commented 4 years ago

One way to "easily" get multilingual data is to machine-translate. pip install googletrans (and then use Translator(service_urls=["translate.google.com/gen204"])) These are older Google Translate Versions, and worse quality than prod, but it's free. The lower quality would only be used in the background though, not shown to the user.

A workflow like this could then work for the user: Type query in Spanish -> QA system detects Spanish query -> QA system matches with Spanish original and/or from-English-translated questions/answers -> QA system shows answers in original language with option to web-translate with Google

This would be easier than real-time translation and/or getting sufficient data in many languages.

stedomedo commented 4 years ago

Multilingual resource can also easily be found using linguee and checking the sources of the found sentences in the language pairs, e.g. for DE: https://www.linguee.com/english-german/search?source=auto&query=coronavirus