Open JessyBarrette opened 2 months ago
Thanks @JessyBarrette!
ERDDAP can then scan through all the subdirectories and look for those specific (*_FINAL.csv) and serve them. Metadatat associated with this dataset should be made available within the base of the directory, ideally with a dataset.xml file, you could use the Citation.cff to generate ACDD equivalent to be use by ERDDAP.
Based on the conversation with DFO et al yesterday, the ERDDAP involved will likely be 'theirs' so I would recommend leaving the related steps and data out of this and allow whoever is in charge of that to work on those elements and workflows.
Load the survey.csv file, the instrument data and sample derived data. Match each derived sample data to a each survey row by using the sample_id. Match, the location where the data was retrived by matching the station column with the station log, and Finally match each survey rows to the nearest in time (after?) record recorded by the minidot and ctd-diver Create a new file (csv) with the survey name and a suffix like (_FINAL.csv) which is the aggregation of all those data.
@timvdstap I am guessing these steps would have likely been done by @JessyBarrette in Python at some stage, are you comfortable automating tasks like this in R? Or is this something I/others should do?
Based on the conversation with DFO et al yesterday, the ERDDAP involved will likely be 'theirs' so I would recommend leaving the related steps and data out of this and allow whoever is in charge of that to work on those elements and workflows.
Ok that sounds good to me, as long as DFO as a consistent way to retrieve the metadata from each repos then it should be alright. That's why I'm thinking the citation.cff file would be a good source which would be useful at multiple places through the lifecycle of the dataset.
That's why I'm thinking the citation.cff file would be a good source which would be useful at multiple places through the lifecycle of the dataset.
Makes sense.
If you want to add steps to generate a citation.cff
to the workflow? Or did I miss it
It is the initial checklist https://github.com/HakaiInstitute/GEM-in-a-box-dataset-repository-template/blob/main/.github/ISSUE_TEMPLATE/init-data-repository-body.md
Which is an issue that gets generated when the template is use to create a new repo.
Great! Thanks.
Thanks for this @JessyBarrette and @steviewanders !
- Sample the water from the niskin and write all the sample_ids collected
Just to confirm - each sample_id needs to be unique and on a separate row, correct?
All those files are pushed to the repository within the specific survey subdirecory.
Confirmation that the proposed repository structure is as follows (specific for these data files):
root
root/data/{YYYY-MM-DD} --- /data/{YYYY-MM-DD}/survey.csv --- /data/{YYYY-MM-DD}/survey.jpeg --- /data/{YYYY-MM-DD}/survey_data_FINAL.csv ------ /data/{YYYY-MM-DD}/minidot -------- /data/{YYYY-MM-DD}/minidot/{original file (format)} -------- /data/{YYYY-MM-DD}/minidot/{processed file (format)} ----- /data/{YYYY-MM-DD}/ctd-diver ----- /data/{YYYY-MM-DD}/aquafluor ----- /data/{YYYY-MM-DD}/DR1900
Ok that sounds good to me, as long as DFO as a consistent way to retrieve the metadata from each repos then it should be alright. That's why I'm thinking the citation.cff file would be a good source which would be useful at multiple places through the lifecycle of the dataset.
Sorry I'm not quite familiar with how citation.cff will help ensure that metadata from each repo is consistently retrieved at different stages.
- Metadatat associated with this dataset should be made available within the base of the directory, ideally with a dataset.xml file, you could use the Citation.cff to generate ACDD equivalent to be use by ERDDAP.
I might need some more explanation on how the dataset.xml file is generated, created?
@timvdstap I am guessing these steps would have likely been done by @JessyBarrette in Python at some stage, are you comfortable automating tasks like this in R? Or is this something I/others should do?
I have joined data tables based on primary keys in R, haven't automated this process. Would like to take a crack with some guidance perhaps if this is expected of Hakai for this project.
Just to confirm - each sample_id needs to be unique and on a separate row, correct?
I was thinking all of them within the same row within the csv, perhaps white space separated. It's more just a way to links samples to when they were collected. Distributing them on multiple rows would end up having a lot of duplicated rows for anything but the sample_ids column. Though I leave it to you to decide.
Confirmation that the proposed repository structure is as follows (specific for these data files):...
Yes that's what I meant, though forget about the dataset.xml
. assuming that all those resulting files are using the same column names then we don't need to have one specific there.
citation.cff
Unless you want to them to fill a hakai metadata record each time they generate a new repo. Then I was thinking that was the more general way to capture most of the metadata associated to a dataset. The Hakai form is better though, up to you to decide which one you want to them to use. Ideally a citation.cff file should be made available here. If you use the hakai form than the erddap xml can taken from https://github.com/HakaiInstitute/hakai-metadata-entry-form-files
I might need some more explanation on how the dataset.xml file is generated, created?
No dataset.xml will be generated here but, if all the different projects are producing similar csv files, a single dataset.xml to which we change the metadata (global attributes) would be fine to serve all those datasets.
I have joined data tables based on primary keys in R, haven't automated this process. Would like to take a crack with some guidance perhaps if this is expected of Hakai for this project.
Not sure how it can be done in R, but in python with pandas. pd.merge_asof and pd.merge would be your friend :)
With also pd.read_csv and the ocean_data_parser.parsers.van_essen.MON and ocean_data_parser.pme.minidot.minidot_txts would be helpful for all that.
Sounds good, thanks for the clarification @JessyBarrette!
I was thinking all of them within the same row within the csv, perhaps white space separated. It's more just a way to links samples to when they were collected. Distributing them on multiple rows would end up having a lot of duplicated rows for anything but the sample_ids column. Though I leave it to you to decide.
My gut feel is to have unique sample_id per row, but that may be just because Darwin Core is very focused on tidy data. There was talk of having some of the data perhaps be mobilized to OBIS (though not entirely sure what data would be best served...) Assuming harvesting by ERDDAP servers, having multiple sample_ids within a single cell does not complicate things?
Create a new file (csv) with the survey name and a suffix like (_FINAL.csv) which is the aggregation of all those data.
Can you clarify what you mean by 'all those data'? Would it be all the derived data (i.e. that can be associated with a sample_id? Timestamped data collected by the miniDOT and ctd-diver is not given a sample_id correct?
I know nothing about this project, but.. if you find you have a large amount of repeated data in a table (csv), perhaps you need two tables (csv's).
My gut feel is to have unique sample_id per row,
I'll leave up to you to decided, it can be fairly straight foward to stack lists of values within a row to a tall format programmatically. I guess my main intention is to simplify as much as possible the work of the user which is filling those survey logs. having to rewrite multiple times the same row just to have separated samples IDs seems not optimal and pontentially prone to errors.
This is mostly to access the data generated by the user. What ever comes out of it to be redirected to OBIS or ERDDAP or ... is a separated topic. In the best world a webform like our magic devices would be ideal but I don't think we want to go that way with that project.
Another option, is to have for each sample type a sample_id column:
station | collection_time | depth | bottle | nutrient_sample_id | chl_sample_id |
---|---|---|---|---|---|
the main problem with this is how you deal with replicated samples and potentially many other issues I can't think of.
I've added a stations.csv
and a survey-template.csv
file to the repo. Feel free to change those for whatever you think is best.
I'll leave up to you to decided, it can be fairly straight foward to stack lists of values within a row to a tall format programmatically. I guess my main intention is to simplify as much as possible the work of the user which is filling those survey logs. having to rewrite multiple times the same row just to have separated samples IDs seems not optimal and pontentially prone to errors.
That's a fair point; thanks for making those template files Jessy!
This seems very similar to the issues I dealt with when building the EIMS. Rather than reinventing the wheel, I suggest there are some lessons learned from the EIMS that could be applied here.
I suggest there are some lessons learned from the EIMS that could be applied here.
Thanks for volunteering lessons from past suffering!
Could you add specifics as they pertain to this? 😅
a few things we discussed:
[kit #]_[event #]
[kit #]_[event #]_NUT_[replicate #]
@timvdstap may have notes of other things we discussed but that is what I can remember from our meetings
here's an example of common log used to track a project specific stations surveyed and survey log.
Each repo should have a single stations.csv
For cruise/surveys, a survey log should be saved within the
data/2000-01-01-Example_Survey/survey.csv
Derived sampled data
where:
Suggested Workflow
A user should add all the stations coordinates they plan to survey for this project.
Field steps
sample_time_collection
In the lab
derived sample results
filesProcessing that data
All those files are pushed to the repository within the specific survey subdirecory.
Once, all that data is available. The suggested workflow would be: