adamreichold / umwelt-info

umwelt.info metadata index
https://umwelt.info
GNU Affero General Public License v3.0
1 stars 0 forks source link

Map all geographic locations to a well-known catalog #78

Open jakobdeller opened 2 years ago

jakobdeller commented 2 years ago

The Wasser-DE portal uses its own list of REGION_IDs which is not particularly helpful in the long run. I would like to map the names given in REGION_NAMES to an established catalog, like www.geonames.org as suggested by DCAT-AP (https://www.dcat-ap.de/def/dcatde/2.0/implRules/#angaben-zur-geografischen-abdeckung)

adamreichold commented 2 years ago

Can geonames.org be download as a whole or should we aim for an integration with a cache and HTTP requests to the online service like we did for SNS in #70?

Would we perform the resolution during harvesting or when the dataset is actually requested? I think during harvesting might be necessary to make the results indexable. But it might also be less efficient because we would need to resolve all names, not just those actually requested. I guess it depends on how effective the caching will be...

adamreichold commented 2 years ago

It seems like there is a web service available but one needs to register for a user account, but there are also data dumps available (with a world wide file weighing in at less than 400 MB).

In this case, the web service would mainly help us to take care of the full text search over the various properties. But insofar we are building a search engine, we should be able to handle this ourselves and have a fully offline solution.

I will look into preprocessing the data dump into a format which allows us efficient mapping of GeoNames ID to region names and the other way around. I suspect that this will take the form a Tantivy-based index prepared by a separate program...

jakobdeller commented 2 years ago

I think we would need to do this before indexing - we probably want to avoid doing these "fixes" in harvesting, to have an infrastructure that works independent of all harvesters. But certainly we need the clean locations before indexing. There are also up-to-date lists of location names provided by the Statistisches Bundesamt. I'll need to see which one we should use.

adamreichold commented 2 years ago

we probably want to avoid doing these "fixes" in harvesting, to have an infrastructure that works independent of all harvesters.

This is not necessarily a contradiction, my plan is to implement a generic infrastructure like "here is some name, turn it into a Region by trying to resolve it using GeoNames or take it as-is if that is not possible". So every harvester would call a generic function for all fields of Region type.