FHNW-IVGI / Geoharvester

NDGI Project Geoharvester
10 stars 1 forks source link

[NLP] Language detection, Keyword translation #33

Closed FStriewski closed 2 weeks ago

FStriewski commented 1 year ago

tbd

MinaKarimi commented 1 year ago

There are several reputable translation tools available for translating:

Google Translate: It is a widely used online translation service that offers translations between many languages. It provides quick and accessible translations, but it may not always capture the full context or nuances of the text.

DeepL Translator: It is another popular online translation service that has gained recognition for its accuracy and natural-sounding translations. It uses advanced neural networks to provide high-quality translations.

Microsoft Translator: Microsoft Translator is a translation platform that offers language translation APIs, as well as a web-based translation tool. It supports a wide range of languages, and provides reliable translations for various purposes.

MinaKarimi commented 1 year ago

Libraries for Google Translate:

translators: Pros: Supports multiple translation services, including Google Translate, Microsoft Translator, Yandex.Translate, and more. Provides a unified interface for accessing various translation services. Can be used with or without API keys for some translation services. Offers options for translating with specific translation services. Cons: Uses web scraping to access translation services, which may be less reliable and may have limitations or changes in the service availability. Limited documentation and community support compared to more established libraries.

googletrans: Pros: Specifically designed for Google Translate, which is a widely used translation service. Supports a wide range of languages. Provides reliable translations when used within the limits of the Google Translate API. Has a more established and widely adopted library with active community support. Cons: Requires an internet connection and access to the Google Translate API. Limited to the capabilities and restrictions of the Google Translate service. May have usage limits and potential changes in the API.

MinaKarimi commented 1 year ago

Libraries for DeepL:

deep_translator: Pros: Provides a Pythonic interface for using DeepL API, making it easy to use and integrate into Python projects. Supports translation between multiple languages offered by DeepL. Allows customization of translation options, such as specifying the target language and adjusting text readability. Cons: The library may not have extensive community support or a large user base compared to more established libraries. The availability of updates and maintenance may vary.

deepl-python: Pros: Offers a Python wrapper for the DeepL API, providing convenient access to translation functionality. Allows translation of text and language detection using DeepL. Can retrieve usage information, such as character and quota usage. Cons: The library may have limited documentation or examples available. It might not be actively maintained or have a large community following

MinaKarimi commented 1 year ago

Consider the first row of our data, the translations to English are shown:

Real text: Diese Karte zeigt die Werkleitungen der Abwasserentsorgung an. Es wird die Strassenentwässerung und Liegenschaftsentwässerung (teilweise mit Hausanschluss) sowie das Kanalisationsnetz dargestellt. Darin ersichtlich sind unter anderem Leitungen mit Schmutzabwasser, Mischabwasser, Regenabwasser, Meteorwasser sowie Versickerungsanlagen, Kontrollschächte, Einlaufschächte und Ölabscheider. WMS Service Geoportal - Kanton Appenzell Innerrhoden

DeepL: This map shows the wastewater disposal pipelines. It shows the street drainage and property drainage (partly with house connection) as well as the sewerage network. Among other things, pipes with wastewater, combined wastewater, rainwater, meteoric water, as well as infiltration facilities, inspection chambers, inlet chambers and oil separators are shown. WMS Service Geoportal - Canton Appenzell Innerrhoden. Translated with www.DeepL.com/Translator (free version)

Google Translate: This map shows the works lines of the sewage disposal. The street drainage and property drainage (partially with house connection) as well as the sewage system are shown. It shows, among other things, lines with dirty sewage, mixed sewage, rain sewage, meteoric water as well as infiltration systems, inspection shafts, inlet shafts and oil separators. WMS Service Geoportal - Canton Appenzell Innerrhoden

I suggest DeepL for reaching the high accuracy. Regarding some limitations with DeepL requests, if we think the number of request will exceed the limitation number, then we can move to google translate.

FStriewski commented 1 year ago

If we use API keys / limits from external services then we probably have a dependency on https://github.com/FHNW-IVGI/Geoharvester/issues/39 so that any translations are only run on new (or updated) datasets and not the full list of 23k datasets

davidoesch commented 1 year ago

Swiss gov has a deepl account-- and therefore there must be api access. If you use it gracefully (only translate new entries) I'm sure this is the best solution

FStriewski commented 1 month ago

This is implemented in our preprocessing pipeline but causes ongoing issues with Github and processing times. We will refine this feature or move back to manual processing.

FStriewski commented 2 weeks ago

Is implemented (translation to all 4 languages), will create follow up if there is a need for refinement