CdC-SI / eak-copilot

The official repository of the EAK-Copilot project as part of the Innovation Fellowship 2024.
https://cdc-si.github.io/eak-copilot/
GNU General Public License v3.0
4 stars 0 forks source link

add scraping of pdf/html in specific language #255

Open K-Schubert opened 1 week ago

K-Schubert commented 1 week ago

Description

HTML HTML from https://eak.admin.ch and https://zas.admin.ch can be done in all 3 official languages (de, fr, it) by specifying the sitemap in the appropriate language in the swagger API interface (eg. sequentially execute https://eak.admin.ch/eak/de/home.sitemap.xml, then https://eak.admin.ch/eak/fr/home.sitemap.xml).

Need to add functionality to scrap either all languages at once, or specify the language acronym.

PDFs For PDFs from https://www.ahv-iv.ch/ mementos, we currently scrap all memento PDFs in all 3 official languages.

Need to add functionality to scrap either all languages at once, or specify the language acronym.