Data Integration - Githubissues

Documenting the proposed data integration here for discussion.

From my understanding, there is a survey-template.csv file that will be populated with survey-specific metadata. Among which will be ids for the nutrient and chlorophyll data that is derived through lab analyses. Whether these ids will be populated as a string in a single column (sample_ids) or parsed out into separate columns (e.g., nutrient_sample_id and chl_sample_id) is yet to be decided. For the data collected by the miniDOT and ctd-diver however, this is more continuous data and if I'm not mistaken, each timestamp will not have a unique ID. This data is connected to the survey-metadata and associated lab-derived data by proximity of data collection.

In practical terms what this means is that if sample_time_collection associated with the nturient_sample_id and chl_sample_id overlaps, or is in close proximity to (what does this mean specifically - within 5 minutes?) to the time interval recorded by the ctd-diver or miniDOT, we can reliably say that they are part of the same survey.

Questions that this raises for me is:

[ ] Should we record 'time_in' and 'time_out' for the ctd-diver/miniDOT?
[ ] Is the ctd-diver/miniDOT part of a single instrument? If not, can it happen that a water sample is collected by the Niskin bottle, but the ctd cast fails, and consequently will have to be dropped again outside of the range that would be considered 'in close proximity of' the timestamp associated with the nutrient_sample_id and chl_sample_id? How would we then ensure that the ctd-diver/miniDOT data is associated with the correct nutrient/chl data?

To answer the questions:

No need to record time_in and time_out from my understand. The times that are recorded are when the instrument measurements are taken and when the water samples are collected. Given the occassional double cast these two times are not always the same.
If a water sample is collected this will have an associated sampling event #. The measurements (ids) will then be nested under this sampling event #. If the entire sampling event fails, then perhaps the sampling event # should be omitted, but if only part of the sampling event fails then it'll be important to keep the sampling event # just because there will some measurements associated with it.

The survey.csv will not have a column for sample_ids included. This would make it more challenging than it needs to be, with lists of sample_ids in a single cell. This is not beneficial for data entry. Instead, Sampling Event Number will also be recorded within derived data sheets, which will help nest sample_ids.

HakaiInstitute / GEM-in-a-box-dataset-repository-template

Data Integration #5