HakaiInstitute / GEM-in-a-box-dataset-repository-template

Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Add reference station and survey log to repo #3

Open JessyBarrette opened 1 week ago

JessyBarrette commented 1 week ago

here's an example of common log used to track a project specific stations surveyed and survey log.

Each repo should have a single stations.csv

station latitude longitude commissioned_time decommisioned_time comments ...

For cruise/surveys, a survey log should be saved within the data/2000-01-01-Example_Survey/survey.csv

station depth collection_time niskin_id sample_ids comments ...

Derived sampled data

sample_id result data ... comments

where:

Suggested Workflow

A user should add all the stations coordinates they plan to survey for this project.

Field steps

  1. On a survey, a user should fill a paper copy of the survey.csv file. Take a picture of that sheet and leave a copy within the survey directory.
  2. Go to a sampling site, and let the instruments and the niskin bottle soak at the target depth for at least 5min.
  3. Send niskin bottle messenger down and save that time as sample_time_collection
  4. Write the station
  5. Sample the water from the niskin and write all the sample_ids collected
  6. Go to the next site...

In the lab

  1. Download data from the minidot and ctd-diver
  2. Upload those files to the repo in the survey specific directory
  3. Process all the samples and add the results to each specific derived sample results files
  4. Copy all the information within the survey log to the survey.csv

Processing that data

All those files are pushed to the repository within the specific survey subdirecory.

Once, all that data is available. The suggested workflow would be:

  1. Load the survey.csv file, the instrument data and sample derived data.
  2. Match each derived sample data to a each survey row by using the sample_id.
  3. Match, the location where the data was retrived by matching the station column with the station log, and
  4. Finally match each survey rows to the nearest in time (after?) record recorded by the minidot and ctd-diver
  5. Create a new file (csv) with the survey name and a suffix like (_FINAL.csv) which is the aggregation of all those data.
  6. ERDDAP can then scan through all the subdirectories and look for those specific (*_FINAL.csv) and serve them.
  7. Metadatat associated with this dataset should be made available within the base of the directory, ideally with a dataset.xml file, you could use the Citation.cff to generate ACDD equivalent to be use by ERDDAP.
steviewanders commented 1 week ago

Thanks @JessyBarrette!

ERDDAP can then scan through all the subdirectories and look for those specific (*_FINAL.csv) and serve them. Metadatat associated with this dataset should be made available within the base of the directory, ideally with a dataset.xml file, you could use the Citation.cff to generate ACDD equivalent to be use by ERDDAP.

Based on the conversation with DFO et al yesterday, the ERDDAP involved will likely be 'theirs' so I would recommend leaving the related steps and data out of this and allow whoever is in charge of that to work on those elements and workflows.

Load the survey.csv file, the instrument data and sample derived data. Match each derived sample data to a each survey row by using the sample_id. Match, the location where the data was retrived by matching the station column with the station log, and Finally match each survey rows to the nearest in time (after?) record recorded by the minidot and ctd-diver Create a new file (csv) with the survey name and a suffix like (_FINAL.csv) which is the aggregation of all those data.

@timvdstap I am guessing these steps would have likely been done by @JessyBarrette in Python at some stage, are you comfortable automating tasks like this in R? Or is this something I/others should do?

JessyBarrette commented 1 week ago

Based on the conversation with DFO et al yesterday, the ERDDAP involved will likely be 'theirs' so I would recommend leaving the related steps and data out of this and allow whoever is in charge of that to work on those elements and workflows.

Ok that sounds good to me, as long as DFO as a consistent way to retrieve the metadata from each repos then it should be alright. That's why I'm thinking the citation.cff file would be a good source which would be useful at multiple places through the lifecycle of the dataset.

steviewanders commented 1 week ago

That's why I'm thinking the citation.cff file would be a good source which would be useful at multiple places through the lifecycle of the dataset.

Makes sense.

If you want to add steps to generate a citation.cff to the workflow? Or did I miss it

JessyBarrette commented 1 week ago

It is the initial checklist https://github.com/HakaiInstitute/GEM-in-a-box-dataset-repository-template/blob/main/.github/ISSUE_TEMPLATE/init-data-repository-body.md

Which is an issue that gets generated when the template is use to create a new repo.

steviewanders commented 1 week ago

Great! Thanks.

timvdstap commented 1 week ago

Thanks for this @JessyBarrette and @steviewanders !

  1. Sample the water from the niskin and write all the sample_ids collected

Just to confirm - each sample_id needs to be unique and on a separate row, correct?

All those files are pushed to the repository within the specific survey subdirecory.

Confirmation that the proposed repository structure is as follows (specific for these data files):

root

root/data/{YYYY-MM-DD} --- /data/{YYYY-MM-DD}/survey.csv --- /data/{YYYY-MM-DD}/survey.jpeg --- /data/{YYYY-MM-DD}/survey_data_FINAL.csv ------ /data/{YYYY-MM-DD}/minidot -------- /data/{YYYY-MM-DD}/minidot/{original file (format)} -------- /data/{YYYY-MM-DD}/minidot/{processed file (format)} ----- /data/{YYYY-MM-DD}/ctd-diver ----- /data/{YYYY-MM-DD}/aquafluor ----- /data/{YYYY-MM-DD}/DR1900

Ok that sounds good to me, as long as DFO as a consistent way to retrieve the metadata from each repos then it should be alright. That's why I'm thinking the citation.cff file would be a good source which would be useful at multiple places through the lifecycle of the dataset.

Sorry I'm not quite familiar with how citation.cff will help ensure that metadata from each repo is consistently retrieved at different stages.

  1. Metadatat associated with this dataset should be made available within the base of the directory, ideally with a dataset.xml file, you could use the Citation.cff to generate ACDD equivalent to be use by ERDDAP.

I might need some more explanation on how the dataset.xml file is generated, created?

@timvdstap I am guessing these steps would have likely been done by @JessyBarrette in Python at some stage, are you comfortable automating tasks like this in R? Or is this something I/others should do?

I have joined data tables based on primary keys in R, haven't automated this process. Would like to take a crack with some guidance perhaps if this is expected of Hakai for this project.

JessyBarrette commented 1 week ago

Just to confirm - each sample_id needs to be unique and on a separate row, correct?

I was thinking all of them within the same row within the csv, perhaps white space separated. It's more just a way to links samples to when they were collected. Distributing them on multiple rows would end up having a lot of duplicated rows for anything but the sample_ids column. Though I leave it to you to decide.

Confirmation that the proposed repository structure is as follows (specific for these data files):...

Yes that's what I meant, though forget about the dataset.xml. assuming that all those resulting files are using the same column names then we don't need to have one specific there.

citation.cff

Unless you want to them to fill a hakai metadata record each time they generate a new repo. Then I was thinking that was the more general way to capture most of the metadata associated to a dataset. The Hakai form is better though, up to you to decide which one you want to them to use. Ideally a citation.cff file should be made available here. If you use the hakai form than the erddap xml can taken from https://github.com/HakaiInstitute/hakai-metadata-entry-form-files

I might need some more explanation on how the dataset.xml file is generated, created?

No dataset.xml will be generated here but, if all the different projects are producing similar csv files, a single dataset.xml to which we change the metadata (global attributes) would be fine to serve all those datasets.

I have joined data tables based on primary keys in R, haven't automated this process. Would like to take a crack with some guidance perhaps if this is expected of Hakai for this project.

Not sure how it can be done in R, but in python with pandas. pd.merge_asof and pd.merge would be your friend :)

JessyBarrette commented 1 week ago

With also pd.read_csv and the ocean_data_parser.parsers.van_essen.MON and ocean_data_parser.pme.minidot.minidot_txts would be helpful for all that.

timvdstap commented 1 week ago

Sounds good, thanks for the clarification @JessyBarrette!

I was thinking all of them within the same row within the csv, perhaps white space separated. It's more just a way to links samples to when they were collected. Distributing them on multiple rows would end up having a lot of duplicated rows for anything but the sample_ids column. Though I leave it to you to decide.

My gut feel is to have unique sample_id per row, but that may be just because Darwin Core is very focused on tidy data. There was talk of having some of the data perhaps be mobilized to OBIS (though not entirely sure what data would be best served...) Assuming harvesting by ERDDAP servers, having multiple sample_ids within a single cell does not complicate things?

Create a new file (csv) with the survey name and a suffix like (_FINAL.csv) which is the aggregation of all those data.

Can you clarify what you mean by 'all those data'? Would it be all the derived data (i.e. that can be associated with a sample_id? Timestamped data collected by the miniDOT and ctd-diver is not given a sample_id correct?

fostermh commented 1 week ago

I know nothing about this project, but.. if you find you have a large amount of repeated data in a table (csv), perhaps you need two tables (csv's).

JessyBarrette commented 1 week ago

My gut feel is to have unique sample_id per row,

I'll leave up to you to decided, it can be fairly straight foward to stack lists of values within a row to a tall format programmatically. I guess my main intention is to simplify as much as possible the work of the user which is filling those survey logs. having to rewrite multiple times the same row just to have separated samples IDs seems not optimal and pontentially prone to errors.

This is mostly to access the data generated by the user. What ever comes out of it to be redirected to OBIS or ERDDAP or ... is a separated topic. In the best world a webform like our magic devices would be ideal but I don't think we want to go that way with that project.

Another option, is to have for each sample type a sample_id column:

station collection_time depth bottle nutrient_sample_id chl_sample_id

the main problem with this is how you deal with replicated samples and potentially many other issues I can't think of.

JessyBarrette commented 1 week ago

I've added a stations.csv and a survey-template.csv file to the repo. Feel free to change those for whatever you think is best.

timvdstap commented 1 week ago

I'll leave up to you to decided, it can be fairly straight foward to stack lists of values within a row to a tall format programmatically. I guess my main intention is to simplify as much as possible the work of the user which is filling those survey logs. having to rewrite multiple times the same row just to have separated samples IDs seems not optimal and pontentially prone to errors.

That's a fair point; thanks for making those template files Jessy!

fostermh commented 1 week ago

This seems very similar to the issues I dealt with when building the EIMS. Rather than reinventing the wheel, I suggest there are some lessons learned from the EIMS that could be applied here.

steviewanders commented 2 days ago

I suggest there are some lessons learned from the EIMS that could be applied here.

Thanks for volunteering lessons from past suffering!

Could you add specifics as they pertain to this? 😅

fostermh commented 2 days ago

a few things we discussed:

@timvdstap may have notes of other things we discussed but that is what I can remember from our meetings