International-Soil-Radiocarbon-Database / ISRaD

Repository for the development and release of ISRaD data and tools
https://international-soil-radiocarbon-database.github.io/ISRaD/
24 stars 15 forks source link

Merging datasets with overlapping sites/profiles/layers #167

Closed coreylawrence closed 5 years ago

coreylawrence commented 5 years ago

@aahoyt @jb388 @ShaneStoner @greymonroe @crlsierra

I am working through an expert review of a dataset that I originally entered but Ariel supplemented. The dataset reference two separate publications (Hall et al., 2015 and 2018) but reports data measured on the exact same samples. This brings up a couple questions (below). I would like to come to agreement on #1 below before I finish this expert review. #2 is more open ended.

  1. In the case where multiple studies (i.e., publications) add new measurements to the exact same samples, is it reasonable to just add those new values to the existing rows within a template? Or is it preferred to enter those data as separate rows that reference the entry from which the data where obtained.

Merging the two during data entry would result in a more simplified template and eliminate the need to merge the data after the fact, but that would come at the expense of loosing a record of which data came from which publication.

  1. In cases where multiple studies report data from the same sites and/or profiles but the data were not measured on the exact same samples, is there still value in including an ISRaD.extra function for merging the datasets?

This is the more general case of question 1. In my own work, I have huge amount of data collected from Santa Cruz and Mattole chronosequences. A lot of those data were measured and reported in a single publication (which I will add to ISRaD soon) but a significant amount of supplemental data was collected in a way that was intended to be comparable with the other measurements. However, those supplemental data were generated on different samples collected either on a different date and/or with slightly different depth intervals. There is clearly value in being able to merge such datasets but is the code required to do it worth the effort? And within the scope of ISRaD?

Kate-Heckman commented 5 years ago
  1. I have just been adding new values to the existing rows, and I think I've seen others take this approach.
  2. I have no firm opinion on this. I think if the measurements were made on a different sample set (sampled at a slightly different depth or year) then they shouldn't be merged. I think the fact that the two datasets will have the same lat/long should let the end user know that they're co-located. It is then on the end user to go in and see what the differences in the two datasets are.

My two cents.

Kate

aahoyt commented 5 years ago

1a. I think it is acceptable (but not required) to merge them. That is what I would have done if they are measurements on the same samples. Make sure to add the new/other reference under "associated_datasets" on the metadata tab. I believe we created that field to handle this type of case. I think this is more commonly how people have been handling it, although there are obviously benefits to both approaches, and it will probably be difficult to standardize. So, I'd say go ahead & merge, but when working with ISRaD recognize that in some cases they won't be merged (eg two papers get entered separately and no one noticed, etc)

1b. This brings up the point that currently associated datasets may not be represented in our complete list of studies if they are only referenced there, and the DOI is not included anywhere. We should see if we are missing many studies from our reference list as a result and how we want to handle that. (for example, maybe we could request that field be filled with a DOI instead of dataset name and also use it for the reference list?)

  1. I think this function sounds great & is something we should add to our wishlist for ISRaD_extra! We had talked about doing something like that at the site level for a long time, since people might not always enter the exact same coordinates. In the short-term, I think it's up to each person analyzing the data to deal with it though.