EnvironmentOntology / gaz

An open source gazetteer constructed on ontological principles
Other
7 stars 5 forks source link

Identify Wikidata reconciliation strategies for GAZ. #43

Open andrawaag opened 1 year ago

andrawaag commented 1 year ago

GAZ does seem to have many mappings to external identifiers (if at all). This makes aligning Wikidata particularly challenging.

To get all terms in GAZ covered in Wikidata we would probably need to apply different strategies to see if a term is already is covered or not.

In the case where the label used in Wikidata exactly matches the term in GAZ, Open refine, can be our friend. I used this tool - offered in for example PAWS - to align GAZ countries with Wikidata.

However, I continued with terms on Suriname in GAZ. So far all terms do exist in Wikidata but most with a different spelling variation. I will try to add all GAZ terms for that country, manually.

So so far two strategies have been applied:

  1. Where the terms match exactly in Wikidata, we can rely on Open Refine
  2. Where the terms exist, but with difference in spelling, manual curation by a curator with local knowledge is required
  3. .......
cmungall commented 1 year ago

I started a repo with some plans in it here:

https://github.com/INCATools/environments2wikidata

there are so many terms, manual curation will be hard. But we can use ontology axioms to aid in the disambiguating...

lots of old code, I will try and update...

andrawaag commented 1 year ago

Today I tried to add as many GAZ identifiers to Wikidata on Suriname as possible (see: https://w.wiki/6CVW).

image

This was basically mainly a manual curation step, where I search for the names in Wikidata and added the respective GAZ identifiers.

lschriml commented 1 year ago

For editing the GAZ: Make a pull request to: edit the GAZ_countries.owl file. As the full GAZ is quite large, we are no longer editing that file.

To edit: First I would check the gaz.owl file that the locations you want are not already in the file. I would recommend using a new ID space:

GAZ:$sequence(8,33333333,44444444) As this will not conflict with already used name spaces

Cheers, Lynn