VertNet / Darwin-Core-Engine

VertNet software
5 stars 0 forks source link

Thesaurus in Fusion Tables #54

Open tucotuco opened 13 years ago

tucotuco commented 13 years ago

Design and implement controlled vocabulary management that can be invoked in a command line switch in the bulkloader. Invoking the switch would do lookups on verbatim values and replace them from the thesaurus with standard values. It might be universal (--lookup-all) or a list of terms to lookup (--lookup cn,g,bor).

One Fusion Table Thesaurus could manage all simple vocabularies. Thoughts on design can be found in [https://github.com/VertNet/Darwin-Core-Engine/wiki/Thesaurus].

eightysteele commented 13 years ago

Right, so Darwin Core controlled vocabularies (e.g., for country) would be managed in a Fusion Table. Entries for the country vocabulary might look like this:

vocab, name, synonyms
country, China, "china,ch,中国"
country, Argentina, "ar,argentina"
country, United States, "united states,us,united states of america,u.s.a"

Then we load the Fusion Table into our remote App Engine cache with an entry for each synonym with keys like this:

cv-name-synonym

Where the cv means controlled vocabulary, the name is the Darwin Core name, and the synonym is a single synonym. So cache entries for Argentina would be:

cv-country-ar=Argentina
cv-country-argentina=Argentina

The bulkloader would have a local cache, and on a miss would grab it from the remote cache over HTTP. Then we have a public Fusion Table where the bulkloader can push unknown synonyms. For example, if the country field in a record is foo, and it's not found in the remote cache, then the bulkloader would push it to the Fusion Table so that we could review and add more synonyms to the authority. That would be a user opt-in feature probably.