ODM2 / YODA-File

The YAML Observation Data Archive & exchange (YODA) File Format
BSD 3-Clause "New" or "Revised" License
4 stars 0 forks source link

UUIDS in YODA Files #11

Open horsburgh opened 9 years ago

horsburgh commented 9 years ago

Should the Excel templates generate GUIDs for use as UUIDs in the YODA files that they generate? If yes, it would be much easier to determine whether certain objects already exist in an ODM2 database instance when trying to load data (e.g., SamplingFeatures, Results, Datasets).

SRGDamia1 commented 9 years ago

I don't know if there is a way to ensure that if someone typed in the same sampling feature into two different excel templates that it would end up with the same UUID.

klehnert55 commented 9 years ago

This is exactly the reason to use IGSNs, not UUIDs.

On 3/26/15 11:28 AM, Sara Damiano wrote:

I don't know if there is a way to ensure that if someone typed in the same sampling feature into two different excel templates that it would end up with the same UUID.

— Reply to this email directly or view it on GitHub https://github.com/CZOData/YODA-File/issues/11#issuecomment-86569009.

Dr. Kerstin Lehnert Director, Integrated Earth Data Applications Director, EarthChem President, IGSN e.V.

Lamont-Doherty Earth Observatory Columbia University Palisades, NY, 10964 (845) 365-8506 http://www.iedadata.org http://www.earthchem.org http://www.igsn.org

SRGDamia1 commented 9 years ago

Both the specimen and timeseries templates allow the user to input IGSN's.

klehnert55 commented 9 years ago

I think that the only way to ensure that the same sampling features in two different spreadsheets receives the same UUID is to check if they have the same IGSN.

On 3/26/15 11:44 AM, Sara Damiano wrote:

Both the specimen and timeseries templates allow the user to input IGSN's.

— Reply to this email directly or view it on GitHub https://github.com/CZOData/YODA-File/issues/11#issuecomment-86581935.

Dr. Kerstin Lehnert Director, Integrated Earth Data Applications Director, EarthChem President, IGSN e.V.

Lamont-Doherty Earth Observatory Columbia University Palisades, NY, 10964 (845) 365-8506 http://www.iedadata.org http://www.earthchem.org http://www.igsn.org

horsburgh commented 9 years ago

Maybe we should just define the UUID fields in ODM2 to accept unique strings (including GUIDs and IGSNs). Not everyone that uses ODM2 will use IGSNs. Plus - we have UUID fields for Results and Datasets as well as SamplingFeatures that we need to handle. But - we need a way for centralized repositories to determine whether the SamplingFeature, Result, and/or Dataset in one YODA file is the same as or different than another YODA file.

SRGDamia1 commented 9 years ago

The excel template accepts user entered UUID's for every field where they exist. It also accepts IGSN's (for all types of sampling features), ORCiD's (for people), and DOI's (for citations only) as separate external identifiers and writes those out following the template of the ExternalIdentifiers schema (completely separate from any UUID's).

SRGDamia1 commented 9 years ago

Excel can generate a GUID by itself. I just don't think there would be anyway of keeping the same GUID between excel worksheets if the same data was submitted again. The new worksheet would assign new GUID's unless the user of the excel template finds the original GUID in their excel-generated YAML and re-enters it. I rather suspect that anyone with enough savvy to be digging the excel generated GUID out of their YAML and inputting it back to their database would most likely have a much better way of assigning a GUID in the first place.

horsburgh commented 9 years ago

I like the approach of letting them enter IGSNs and using the ExternalIdentifiers capability of ODM2/YODA to represent the linkage. For the Excel templates, I think it would be great if you left the UUID fields as text fields into which they could paste a UUID if one already exists. Then, add a button with a function that would generate new GUIDs as UUIDs for anything in the template that needs a new UUID. The UUID fields should be required - as they are in ODM2.

This would really simplify things on the receiving end of this - e.g., someone managing an ODM2 database who is trying to figure out how to load data from a YODA file. Yes, it creates some work for a data manager to re-enter UUIDs if they are using multiple YODA files to represent the same objects (e.g., SamplingFeatures, etc.)

SRGDamia1 commented 9 years ago

The UUID fields are required in ODM2 because the database automatically generates them. Is it worth making excel generate UUID's when whatever database the YODA file is being read into should already have the tools internally to generate them? We're already asking excel to do a lot....

I definitely won't remove the ability for users to add text UUID's if they have them.

Right now the way I'm setting it up, it would be pretty tough for a data manager to actually pick apart all of the UUID's that excel would assign to their data, especially for the results. I doubt anyone would do it. That being said, while I couldn't stop excel from generating a new UUID for the sample when it's put in a second time, it would take a lot of logic and intervention to keep any database generating a UUID from doing the same.

horsburgh commented 9 years ago

We should talk to the SDSC guys about this and think through the workflow from start to finish. The biggest implications are for matching things up within metadata cataloging and aggregating data within centralized instance(s) of ODM2. But it is also an issue for any data loaders we write for ODM2 that need to understand YODA files.

There is a SQL function that can be used to create GUIDs for the UUID fields (at least for MS SQL Server), but I don't think its going to be consistent across all of the RDBMSs, and so it isn't quite the case that the "database automatically generates them." And, it may not be the case that we want the database to do that (at least in all cases).

emiliom commented 9 years ago

Yes, thinking through likely workflows from start to finish is key. But ideally that will involve not just the needs of the aggregation catalog, but also scenarios of how data providers (CZO site data managers) are likely to be managing their data for themselves and using the Excel templates to generate YODA's. Expecting providers to know when to create new UUID's, and to save (likely not just within the Excel template and YODA files) for future reuse the UUID's they create, will probably take a lot of ongoing education and some degree of testing/validation to catch likely cases of linked piece of information using different UUID's. My 0.5 cent, anyway.

horsburgh commented 8 years ago

The Time Series template now contains capability for generating UUIDS for the user. I'm not closing this issue given that the other templates do not have this capability yet.