Non-unique names for different sites/profiles/layers

alkalifly commented 6 years ago

At the moment, there is nothing to prevent name duplication between entries, so we have sites, profiles, and layers with non-unique names. This makes it tricky to get upstream information. For example, if I'm working with a layer and I want the land cover of that layer's profile, I cannot just check based on the pro_name of that layer, because there could be more than one profile with the same pro_name. Instead, I have to check entry_name, site_name, and pro_name all together to make sure I'm getting the correct information.

Is there any way that we could enforce unique names for these entries? Perhaps something in the compile function could check each name as it's being added, and if it finds it to be a duplicate, could append a random or sequential string to it? I imagine this would add a lot of processing time to the function, so it could certainly be optional, but it would save a lot of effort in the future when working with the data.

jb388 commented 6 years ago

This issue has cropped up a number of times. Enforcing unique names upstream of the user is next to impossible. While it's doable to reference multiple columns when looking for unique IDs, I find it easier to assign new names. In my typical workflow I create new index columns (just sequential numbers) with whatever the current version of the database is, at each level. I maintain the old names for reference back to the templates/papers, but use the new names for sorting. This could be added as a vignette/ISRaD.extra function.

alkalifly commented 6 years ago

Thanks, Jeff. I am curious to understand your workflow with a new index column. How does that tell you, e.g., which entry on the profile tab goes with the layer you are looking at, if there is more than one profile with that layer's pro_name?

Note: I am working with ISRaD_list.xlsx outside of R. If the R object has some way of linking each layer to its unique profile, even when the profile name is not unique, then that doesn't work here.

My temporary workaround has been to cross reference entries using a combination of entry_name, site_name, and pro_name. It's cumbersome, but more importantly, even that combination is not enforced as unique. There are profiles in the current version of ISRaD for which it is not possible to tell the site coordinates, because there is more than one site with the same name. I don't think your approach of creating new index columns could be of help with this.

At the moment, this is only a problem with one entry: McClung_de_Tapia_2005.xlsx. This template was generated from the He compilation, so I was able to go back to Yujie's original data and fix it (see issue #93). But these identical site names passed QAQC in this case, meaning this problem could come up with new templates in the future. It seems that at very least, QAQC should make sure that each site and profile name is unique in each individual template, so that as long as each new entry name is unique, there will be some unique combination of entry, site, and profile names.

jb388 commented 6 years ago

Hmm, I'm not quite sure I understand the issue. But I added an additional QAQC check to make sure site names aren't duplicated for a given entry (the prior version of QAQC only checked to make sure the site coordinates weren't duplicated).

I guess if you are working with the data in excel this doesn't help, but within the R list object each layer is linked to its rightful profile even when the profile name itself isn't unique (i.e. dependencies are preserved).

alkalifly commented 6 years ago

Okay, it seems this is only an issue for those of us working outside of R, i.e., with the Excel file or flattened CSV.

The issue is that there could be more than one site with the same name on the same template, so the entry name would also be the same. This means that if you are outside of R, it is actually impossible to tell which of the identically named sites go with each of the profiles. In addition, if more than one profile from any entry/site has the same name, then it is impossible to tell which profile goes with each of the layers.

Your addition to the QAQC to prevent duplicate sites definitely will help. It would be helpful also to make sure that profile names are not duplicated, as that will help with the "in addition" case I mentioned above. And I suppose we also need to make sure that layer names aren't duplicated either, to avoid any ambiguity at lower levels (interstitial and fraction).

With those checks in place in the QAQC, we are almost 100% safe from these potential ambiguities. The only way it could be a problem is if a template gets added with a non-unique entry name, and that template has identically named sites and/or profiles as a template with identical entry names. That template would still pass QAQC on it's own, but could introduce profiles with identical site and entry names, and/or layers with identical profile, site, and entry names.

So, the final step would be to make sure that we have a policy in place to manually check for identical entry names before adding any new templates, and hope that human error doesn't get in the way there.

jb388 commented 6 years ago

We definitely need a procedure in place to prevent duplication of whole entries. We've talked about this since day 1 of the MPI involvement, but nothing is currently implemented (other than manually checking). But, it would be simple enough to add a duplicate check for the entry_name field in the metadata table of the list object once it is compiled. I can work on that.

FYI, QAQC already does check for duplication at the profile, layer, fraction, interstitial, incubation, and flux levels (for a given entry). This is implemented as a search for duplicate names at the respective level: pro_name, lyr_name, etc. Name combinations (profile/layer/etc.) are checked across levels as well.

greymonroe commented 5 years ago

We discussed this today at Irvine and I think its an issue that we have figured out. New datasets should be checked by the ISRaD editor to verify that they are not a duplicate of another template.

International-Soil-Radiocarbon-Database / ISRaD

Non-unique names for different sites/profiles/layers #90