Add reference station and survey log to repo

JessyBarrette commented 2 months ago

here's an example of common log used to track a project specific stations surveyed and survey log.

Each repo should have a single stations.csv

station	latitude	longitude	commissioned_time	decommisioned_time	comments	...

For cruise/surveys, a survey log should be saved within the data/2000-01-01-Example_Survey/survey.csv

station	depth	collection_time	niskin_id	sample_ids	comments	...

Derived sampled data

sample_id	result data	...	comments

where:

station in survey.csv should match a station in station.csv
sample_time_collection: is the time when the niskin bottle was closed
sample_ids: is a list of sample_id collected which will be use to match each derived sample data.

Suggested Workflow

A user should add all the stations coordinates they plan to survey for this project.

Field steps

On a survey, a user should fill a paper copy of the survey.csv file. Take a picture of that sheet and leave a copy within the survey directory.
Go to a sampling site, and let the instruments and the niskin bottle soak at the target depth for at least 5min.
Send niskin bottle messenger down and save that time as sample_time_collection
Write the station
Sample the water from the niskin and write all the sample_ids collected
Go to the next site...

In the lab

Download data from the minidot and ctd-diver
Upload those files to the repo in the survey specific directory
Process all the samples and add the results to each specific derived sample results files
Copy all the information within the survey log to the survey.csv

Processing that data

All those files are pushed to the repository within the specific survey subdirecory.

Once, all that data is available. The suggested workflow would be:

Load the survey.csv file, the instrument data and sample derived data.
Match each derived sample data to a each survey row by using the sample_id.
Match, the location where the data was retrived by matching the station column with the station log, and
Finally match each survey rows to the nearest in time (after?) record recorded by the minidot and ctd-diver
Create a new file (csv) with the survey name and a suffix like (_FINAL.csv) which is the aggregation of all those data.
ERDDAP can then scan through all the subdirectories and look for those specific (*_FINAL.csv) and serve them.
Metadatat associated with this dataset should be made available within the base of the directory, ideally with a dataset.xml file, you could use the Citation.cff to generate ACDD equivalent to be use by ERDDAP.

steviewanders commented 2 months ago

Thanks @JessyBarrette!

ERDDAP can then scan through all the subdirectories and look for those specific (*_FINAL.csv) and serve them. Metadatat associated with this dataset should be made available within the base of the directory, ideally with a dataset.xml file, you could use the Citation.cff to generate ACDD equivalent to be use by ERDDAP.

Based on the conversation with DFO et al yesterday, the ERDDAP involved will likely be 'theirs' so I would recommend leaving the related steps and data out of this and allow whoever is in charge of that to work on those elements and workflows.

Load the survey.csv file, the instrument data and sample derived data. Match each derived sample data to a each survey row by using the sample_id. Match, the location where the data was retrived by matching the station column with the station log, and Finally match each survey rows to the nearest in time (after?) record recorded by the minidot and ctd-diver Create a new file (csv) with the survey name and a suffix like (_FINAL.csv) which is the aggregation of all those data.

@timvdstap I am guessing these steps would have likely been done by @JessyBarrette in Python at some stage, are you comfortable automating tasks like this in R? Or is this something I/others should do?

JessyBarrette commented 2 months ago

Based on the conversation with DFO et al yesterday, the ERDDAP involved will likely be 'theirs' so I would recommend leaving the related steps and data out of this and allow whoever is in charge of that to work on those elements and workflows.

Ok that sounds good to me, as long as DFO as a consistent way to retrieve the metadata from each repos then it should be alright. That's why I'm thinking the citation.cff file would be a good source which would be useful at multiple places through the lifecycle of the dataset.

steviewanders commented 2 months ago

That's why I'm thinking the citation.cff file would be a good source which would be useful at multiple places through the lifecycle of the dataset.

Makes sense.

If you want to add steps to generate a citation.cff to the workflow? Or did I miss it

JessyBarrette commented 2 months ago

It is the initial checklist https://github.com/HakaiInstitute/GEM-in-a-box-dataset-repository-template/blob/main/.github/ISSUE_TEMPLATE/init-data-repository-body.md

Which is an issue that gets generated when the template is use to create a new repo.

steviewanders commented 2 months ago

Great! Thanks.

timvdstap commented 2 months ago

Thanks for this @JessyBarrette and @steviewanders !

Sample the water from the niskin and write all the sample_ids collected

Just to confirm - each sample_id needs to be unique and on a separate row, correct?

All those files are pushed to the repository within the specific survey subdirecory.

Confirmation that the proposed repository structure is as follows (specific for these data files):

root

stations.csv
dataset.xml??

root/data/{YYYY-MM-DD} --- /data/{YYYY-MM-DD}/survey.csv --- /data/{YYYY-MM-DD}/survey.jpeg --- /data/{YYYY-MM-DD}/survey_data_FINAL.csv ------ /data/{YYYY-MM-DD}/minidot -------- /data/{YYYY-MM-DD}/minidot/{original file (format)} -------- /data/{YYYY-MM-DD}/minidot/{processed file (format)} ----- /data/{YYYY-MM-DD}/ctd-diver ----- /data/{YYYY-MM-DD}/aquafluor ----- /data/{YYYY-MM-DD}/DR1900

Ok that sounds good to me, as long as DFO as a consistent way to retrieve the metadata from each repos then it should be alright. That's why I'm thinking the citation.cff file would be a good source which would be useful at multiple places through the lifecycle of the dataset.

Sorry I'm not quite familiar with how citation.cff will help ensure that metadata from each repo is consistently retrieved at different stages.

Metadatat associated with this dataset should be made available within the base of the directory, ideally with a dataset.xml file, you could use the Citation.cff to generate ACDD equivalent to be use by ERDDAP.

I might need some more explanation on how the dataset.xml file is generated, created?

@timvdstap I am guessing these steps would have likely been done by @JessyBarrette in Python at some stage, are you comfortable automating tasks like this in R? Or is this something I/others should do?

I have joined data tables based on primary keys in R, haven't automated this process. Would like to take a crack with some guidance perhaps if this is expected of Hakai for this project.

JessyBarrette commented 2 months ago

Just to confirm - each sample_id needs to be unique and on a separate row, correct?

I was thinking all of them within the same row within the csv, perhaps white space separated. It's more just a way to links samples to when they were collected. Distributing them on multiple rows would end up having a lot of duplicated rows for anything but the sample_ids column. Though I leave it to you to decide.

Confirmation that the proposed repository structure is as follows (specific for these data files):...

Yes that's what I meant, though forget about the dataset.xml. assuming that all those resulting files are using the same column names then we don't need to have one specific there.

citation.cff

Unless you want to them to fill a hakai metadata record each time they generate a new repo. Then I was thinking that was the more general way to capture most of the metadata associated to a dataset. The Hakai form is better though, up to you to decide which one you want to them to use. Ideally a citation.cff file should be made available here. If you use the hakai form than the erddap xml can taken from https://github.com/HakaiInstitute/hakai-metadata-entry-form-files

I might need some more explanation on how the dataset.xml file is generated, created?

No dataset.xml will be generated here but, if all the different projects are producing similar csv files, a single dataset.xml to which we change the metadata (global attributes) would be fine to serve all those datasets.

I have joined data tables based on primary keys in R, haven't automated this process. Would like to take a crack with some guidance perhaps if this is expected of Hakai for this project.

Not sure how it can be done in R, but in python with pandas. pd.merge_asof and pd.merge would be your friend :)

JessyBarrette commented 2 months ago

With also pd.read_csv and the ocean_data_parser.parsers.van_essen.MON and ocean_data_parser.pme.minidot.minidot_txts would be helpful for all that.

timvdstap commented 2 months ago

Sounds good, thanks for the clarification @JessyBarrette!

I was thinking all of them within the same row within the csv, perhaps white space separated. It's more just a way to links samples to when they were collected. Distributing them on multiple rows would end up having a lot of duplicated rows for anything but the sample_ids column. Though I leave it to you to decide.

My gut feel is to have unique sample_id per row, but that may be just because Darwin Core is very focused on tidy data. There was talk of having some of the data perhaps be mobilized to OBIS (though not entirely sure what data would be best served...) Assuming harvesting by ERDDAP servers, having multiple sample_ids within a single cell does not complicate things?

Create a new file (csv) with the survey name and a suffix like (_FINAL.csv) which is the aggregation of all those data.

Can you clarify what you mean by 'all those data'? Would it be all the derived data (i.e. that can be associated with a sample_id? Timestamped data collected by the miniDOT and ctd-diver is not given a sample_id correct?

fostermh commented 2 months ago

I know nothing about this project, but.. if you find you have a large amount of repeated data in a table (csv), perhaps you need two tables (csv's).

JessyBarrette commented 2 months ago

My gut feel is to have unique sample_id per row,

I'll leave up to you to decided, it can be fairly straight foward to stack lists of values within a row to a tall format programmatically. I guess my main intention is to simplify as much as possible the work of the user which is filling those survey logs. having to rewrite multiple times the same row just to have separated samples IDs seems not optimal and pontentially prone to errors.

This is mostly to access the data generated by the user. What ever comes out of it to be redirected to OBIS or ERDDAP or ... is a separated topic. In the best world a webform like our magic devices would be ideal but I don't think we want to go that way with that project.

Another option, is to have for each sample type a sample_id column:

station	collection_time	depth	bottle	nutrient_sample_id	chl_sample_id

the main problem with this is how you deal with replicated samples and potentially many other issues I can't think of.

JessyBarrette commented 2 months ago

I've added a stations.csv and a survey-template.csv file to the repo. Feel free to change those for whatever you think is best.

timvdstap commented 2 months ago

I'll leave up to you to decided, it can be fairly straight foward to stack lists of values within a row to a tall format programmatically. I guess my main intention is to simplify as much as possible the work of the user which is filling those survey logs. having to rewrite multiple times the same row just to have separated samples IDs seems not optimal and pontentially prone to errors.

That's a fair point; thanks for making those template files Jessy!

fostermh commented 2 months ago

This seems very similar to the issues I dealt with when building the EIMS. Rather than reinventing the wheel, I suggest there are some lessons learned from the EIMS that could be applied here.

steviewanders commented 2 months ago

I suggest there are some lessons learned from the EIMS that could be applied here.

Thanks for volunteering lessons from past suffering!

Could you add specifics as they pertain to this? 😅

fostermh commented 2 months ago

a few things we discussed:

collect instrument data in UTC time
add a timezone to the paper field forms as data collection will be "world wide"
how to identify a sampling 'event'
- As each participant will get one 'kit' the event can be [kit #]_[event #]
- a sampling event will be at a site, in a date/time range (start to end), at a depth
- this will allow linking of bottle samples with instrument data
Initially the thinking was to include all bottle sample id's on the same line as the survey data in the survey.csv file
- This could work if the only metadata required for the bottle samples are their id's
- if more then just the id is required, ie any other metadata, then the bottle samples should be recorded in their own csv and the event id must be included as one of the columns.
There was some discussion around including significant metadata in event and sample id's. I consider this to be a bad idea.
- I think the suggestion of basing bottle sample id's on the event id is reasonable
- nutrient id example: [kit #]_[event #]_NUT_[replicate #]

@timvdstap may have notes of other things we discussed but that is what I can remember from our meetings

HakaiInstitute / GEM-in-a-box-dataset-repository-template