BirdsCanada / NatureCountsAPI

NatureCountsAPI
0 stars 1 forks source link

Metadata updates #10

Closed steffilazerte closed 5 years ago

steffilazerte commented 5 years ago

I've been thinking a bit about metadata updates, and I realized that there are probably two types of metadata: those that update relatively frequently, perhaps like the collections API entry point (which I assume would be updated as new collections are added?), and those that update relatively infrequently, like the species taxonomy, state/prov codes, etc.

Because I use the API version to update the local metadata, I don't want to store any metadata locally that, if changed, would not result in an update to the API version.

So, my question is, do we have a list of metadata that would create a change in API version?

denislepage commented 5 years ago

I would want all metadata tables to be available locally in all cases.

The annual species update is the most likely reason that I see for generating a new metadata version. That is a more consequential change, where I expect that not only the metadata table will change, but we will also be updating the data to match for any taxonomic changes. The same number may switch from representing a subspecies to being a full species, new numbers will be introduced, numbers will become obsolete, etc.

Most of the other changes won’t be consequential or substantial enough to warrant a change in version.

Changes to the metadata that are strictly incremental (e.g. only new records being added) won’t trigger a new metadata version. E.g. adding new collections.

Likewise, changes that are not directly affecting a primary key should also not trigger a new metadata version number, as long as they don’t represent a fundamental change in meaning. Tweaking the name of a collection, or even doing things like changing the project a collection belongs to, etc. shouldn’t be a sufficient concern. They will likely clarify the metadata, but won’t really break the relational integrity.

We’ll want to minimize those occurrences (and they will be rare), but even if there were minor changes that break the integrity of the data on the user side, I wouldn’t be concerned enough to trigger a new version. Say a collection was assigned a new code or deleted, and people have data in their file that no longer matches an entry in the metadata table. Oh well. ¯_(ツ)_/¯

You could possibly minimize this type of minor problems by ensuring that your metadata updates do not delete records in the metadata tables, but rather only do updates and inserts based on the table key. The local metadata tables may thus become a bit out of sync and may include a few records that have since been removed on the server’s end, but at least they will likely match the data that people have.

Hopefully that makes sense.

steffilazerte commented 5 years ago

That is a fantastic emoticon!

Ahem, anyway. I can think of three versions of metadata:

  1. Remote API
  2. Local R package (Can be updated with the nc_metadata() function)
  3. Local user SQLite tables (Currently cannot be updated, if it's out of data, users need to start a new database, I agree "Oh well!")

If a user specifies that the downloaded observations should be stored in a SQLite database, it will also store all the metadata tables locally, in that database. So the users with SQLite databases will always have access to that local metadata.

What I'm considering is the presence of (2), metadata stored in the R package. And I think in this case, I'm only going to store data that is large, and/or required by the search functions. All other sources of metadata will be downloaded when a new SQLite database is created, or when the user specifically requests it with a particular function.

Sound reasonable?

denislepage commented 5 years ago

Yes, I think that sounds good.

Sorry for conflating again the local R metadata and the SQLite metadata.

steffilazerte commented 5 years ago

No problem, I feel like I do it all the time!