Open tucotuco opened 13 years ago
Right, so Darwin Core controlled vocabularies (e.g., for country) would be managed in a Fusion Table. Entries for the country vocabulary might look like this:
vocab, name, synonyms
country, China, "china,ch,中国"
country, Argentina, "ar,argentina"
country, United States, "united states,us,united states of america,u.s.a"
Then we load the Fusion Table into our remote App Engine cache with an entry for each synonym with keys like this:
cv-name-synonym
Where the cv
means controlled vocabulary
, the name
is the Darwin Core name, and the synonym
is a single synonym. So cache entries for Argentina would be:
cv-country-ar=Argentina
cv-country-argentina=Argentina
The bulkloader would have a local cache, and on a miss would grab it from the remote cache over HTTP. Then we have a public Fusion Table where the bulkloader can push unknown synonyms. For example, if the country
field in a record is foo
, and it's not found in the remote cache, then the bulkloader would push it to the Fusion Table so that we could review and add more synonyms to the authority. That would be a user opt-in feature probably.
Design and implement controlled vocabulary management that can be invoked in a command line switch in the bulkloader. Invoking the switch would do lookups on verbatim values and replace them from the thesaurus with standard values. It might be universal (--lookup-all) or a list of terms to lookup (--lookup cn,g,bor).
One Fusion Table Thesaurus could manage all simple vocabularies. Thoughts on design can be found in [https://github.com/VertNet/Darwin-Core-Engine/wiki/Thesaurus].