emo-bon / governance-data

Holds the governance content for the emo-bon data management
0 stars 0 forks source link

change the source mat id for sediment? #28

Open kmexter opened 1 week ago

kmexter commented 1 week ago

Currently all of the sediment logsheets except one have this equation in the final column of the samplng tab - the source_matid =CONCATENATE(observatory!$A$2,"",H2,"",L2,"",N2)

For EM21 (https://docs.google.com/spreadsheets/d/1sBB0x6h-prnUHMcOfms_7qSqTn6zeGVriitTysceoT0/edit?gid=124596284#gid=124596284) however, the equation is =CONCATENATE(observatory!$A$2,"",H2,"",L2,"",P2,"mm","",N2) which differs in that it includes size_frac_up. It needs to, because otherwise it would have rows with the same ID (rows 10 and 14 for example), which is a big no-no The suggestion is that we implement this longer equation into all the sediment logsheets

To be considered: the events from this station goes back to 2021, so we have to change it for all samples from sediment for all observatories: that means changing the IDs for the batches in ENA already (if there are any sediment there) and the IDs in the governance spreasdsheets in GH

melinalou commented 1 week ago

I do not understand the last sentence " the events from this station goes back to 2021, so we have to change it for all samples from sediment for all observatories: that means changing the IDs for the batches in ENA already (if there are any sediment there) and the IDs in the governance spreadsheets in GH".

What I do understand is that I need to change the equation of source_mat_id into all the sediment logsheets except for EM21.

kmexter commented 1 week ago

well, we need to decide IF we should change. Changing the equation in the logsheet (and the description and example) is easy (albeit cumbersom), but we also have to change the IDs that are in ENA (Christina has to do that) and then the IDs in the various governance files (I have to do that). For me it is OK to make that change, but Christina - are you OK with changing the IDs in ENA? And Cymon - you may also have to do that, change IDs in your file where you have the mat_samp_id vs genoscope ID information?

cymon commented 1 week ago

On Wed, 4 Sept 2024 at 10:59, Katrina Exter @.***> wrote:

well, we need to decide IF we should change. Changing the equation in the logsheet (and the description and example) is easy (albeit cumbersom), but we also have to change the IDs that are in ENA (Christina has to do that) and then the IDs in the various governance files (I have to do that). For me it is OK to make that change, but Christina - are you OK with changing the IDs in ENA? And Cymon - you may also have to do that, change IDs in your file where you have the mat_samp_id vs genoscope ID information?

Yes, the "source_mat_id" must be unique - as far as I am aware there are only 2 duplicates in the sheets at the moment: "EMOBON_ROSKOGO_Wa_210618_3um_1", "EMOBON_PiEGetxo_Wa_210824_3um_blank"

Provided the "source_mat_id" of an observatory sampling event is the same in the "sampling" and "measured" sheets, and it matches the "source_material_id"* in the "run-information-batch-00(X).csv" sheet, all will be well.

Please also change the "source_material_id" in the Bergen "sampling" and "measured" sheets back to "source_mat_id" so that it is consistent with the other observatories: https://github.com/emo-bon/observatory-profile/issues/16

Message ID: @.***>

C.

--


Cymon J. Cox

Senior Researcher Plant Systematics and Bioinformatics Digital Laboratory Centro de Ciencias do Mar (CCMAR) - CIMAR-Lab. Assoc.

Mailing address: CCMAR - Centro de Ciencias do Mar, Universidade do Algarve Campus de Gambelas Edif. 7 8005-139 Faro Portugal

Phone: +351 289800051 ext 7380 Fax: +351 289800051 Email: @.***

@CCMAR https://ccmar.ualg.pt/users/cymon Google Scholar https://scholar.google.co.uk/citations?user=f5M7DhkAAAAJ&hl=en&oi=ao Scopus http://www.scopus.com/inward/authorDetails.url?authorID=7402112716&partnerID=MN8TOARS
Orcid http://orcid.org/0000-0002-4927-979X CienciaVitae

https://www.cienciavitae.pt/6B15-9771-1D04 GPG: Public key on keyserver.ubuntu.com


cymon commented 1 week ago

Note also that HCMR-1 take more than one "blank" sample at each event. Consequently they have to have "blank1" and "blank2" the replicate field, else the "source_mat_id"'s would be the same.

https://docs.google.com/spreadsheets/d/13DcVK2mzSxMJoFydSBaIMmj7Td1_JapEvcY2bmZTyLc/edit?gid=1225064690#gid=1225064690

They are the only station to do this and it causes consistency problems that require ad hoc solutions. It would be better if the second blank replicate were not recorded as a "full" sampling event (ie just say duplicate of this "source_mat_id").

C.

On Wed, 4 Sept 2024 at 11:39, Cymon J. Cox @.***> wrote:

On Wed, 4 Sept 2024 at 10:59, Katrina Exter @.***> wrote:

well, we need to decide IF we should change. Changing the equation in the logsheet (and the description and example) is easy (albeit cumbersom), but we also have to change the IDs that are in ENA (Christina has to do that) and then the IDs in the various governance files (I have to do that). For me it is OK to make that change, but Christina - are you OK with changing the IDs in ENA? And Cymon - you may also have to do that, change IDs in your file where you have the mat_samp_id vs genoscope ID information?

Yes, the "source_mat_id" must be unique - as far as I am aware there are only 2 duplicates in the sheets at the moment: "EMOBON_ROSKOGO_Wa_210618_3um_1", "EMOBON_PiEGetxo_Wa_210824_3um_blank"

Provided the "source_mat_id" of an observatory sampling event is the same in the "sampling" and "measured" sheets, and it matches the "source_material_id"* in the "run-information-batch-00(X).csv" sheet, all will be well.

  • "source_material_id" in the run-information sheets should be renamed "source_mat_id" so that we know it's the same data/key/identifier - it doesn't really matter, but just makes it easier to interpret.

Please also change the "source_material_id" in the Bergen "sampling" and "measured" sheets back to "source_mat_id" so that it is consistent with the other observatories: https://github.com/emo-bon/observatory-profile/issues/16

Message ID: @.***>

C.

--


Cymon J. Cox

Senior Researcher Plant Systematics and Bioinformatics Digital Laboratory Centro de Ciencias do Mar (CCMAR) - CIMAR-Lab. Assoc.

Mailing address: CCMAR - Centro de Ciencias do Mar, Universidade do Algarve Campus de Gambelas Edif. 7 8005-139 Faro Portugal

Phone: +351 289800051 ext 7380 Fax: +351 289800051 Email: @.***

@CCMAR https://ccmar.ualg.pt/users/cymon | Google Scholar https://scholar.google.co.uk/citations?user=f5M7DhkAAAAJ&hl=en&oi=ao | Scopus http://www.scopus.com/inward/authorDetails.url?authorID=7402112716&partnerID=MN8TOARS | Orcid http://orcid.org/0000-0002-4927-979X | CienciaVitae https://www.cienciavitae.pt/6B15-9771-1D04 GPG: Public key on keyserver.ubuntu.com


--


Cymon J. Cox

Senior Researcher Plant Systematics and Bioinformatics Digital Laboratory Centro de Ciencias do Mar (CCMAR) - CIMAR-Lab. Assoc.

Mailing address: CCMAR - Centro de Ciencias do Mar, Universidade do Algarve Campus de Gambelas Edif. 7 8005-139 Faro Portugal

Phone: +351 289800051 ext 7380 Fax: +351 289800051 Email: @.***

@CCMAR https://ccmar.ualg.pt/users/cymon Google Scholar https://scholar.google.co.uk/citations?user=f5M7DhkAAAAJ&hl=en&oi=ao Scopus http://www.scopus.com/inward/authorDetails.url?authorID=7402112716&partnerID=MN8TOARS
Orcid http://orcid.org/0000-0002-4927-979X CienciaVitae

https://www.cienciavitae.pt/6B15-9771-1D04 GPG: Public key on keyserver.ubuntu.com


cpavloud commented 1 week ago

It will be easier to do it now (that not all the batches are in ENA) than later. I'm not happy about it because I have to change them one by one, but what can we do :) Either way it is very likely that I will have to do it anyhow because of the size fraction issues (see here).

By the way @melinalou I saw that the size fraction definitions and values were also swapped (so wrong) in the sediment logsheets too.... So we should do the changes in the sediment logsheets that were already done in the water logsheets.

kmexter commented 1 week ago

Note also that HCMR-1 take more than one "blank" sample at each event. Consequently they have to have "blank1" and "blank2" the replicate field, else the "source_mat_id"'s would be the same. https://docs.google.com/spreadsheets/d/13DcVK2mzSxMJoFydSBaIMmj7Td1_JapEvcY2bmZTyLc/edit?gid=1225064690#gid=1225064690 They are the only station to do this and it causes consistency problems that require ad hoc solutions. It would be better if the second blank replicate were not recorded as a "full" sampling event (ie just say duplicate of this "source_mat_id"). C.


They are recorded as blank1 and blank2 in the googlesheets... @cymon can you remind me which files you are looking at (give me the URLs) - I know that once we have finally decided on the mat samp ids we will need to update https://github.com/emo-bon/sequencing-data/blob/main/shipment/batch-001/ena-accession-numbers-batch-001.csv and https://github.com/emo-bon/sequencing-data/blob/main/shipment/batch-001/run-information-batch-001.csv (and ditto for all batches), is that enough, is it one of these that you use to match to MGF?

cpavloud commented 1 week ago

@cymon this issue is already discussed here

melinalou commented 1 week ago

@cpavloud I am sorry for this. So in the sediment logsheets we should have the definitions in the original format? Before changing the size frac up and low? If so, I could bring them back to their original form. If not, I can make the changes needed, to reverse the columns size_frac_up and low as I've done in the sampling logsheets.

cpavloud commented 1 week ago

Oh sorry, I just saw that you have updated the definitions. Let me check one more thing.

cymon commented 1 week ago

On Wed, 4 Sept 2024 at 11:56, Katrina Exter @.***> wrote:

Note also that HCMR-1 take more than one "blank" sample at each event. Consequently they have to have "blank1" and "blank2" the replicate field, else the "source_mat_id"'s would be the same. https://docs.google.com/spreadsheets/d/13DcVK2mzSxMJoFydSBaIMmj7Td1_JapEvcY2bmZTyLc/edit?gid=1225064690#gid=1225064690 They are the only station to do this and it causes consistency problems that require ad hoc solutions. It would be better if the second blank replicate were not recorded as a "full" sampling event (ie just say duplicate of this "source_matid"). C. … <#m-7243890141814151893_>

They are recorded as blank1 and blank2 in the googlesheets... @cymon https://github.com/cymon can you remind me which files you are looking at (give me the URLs) - I know that once we have finally decided on the mat samp ids we will need to update https://github.com/emo-bon/sequencing-data/blob/main/shipment/batch-001/ena-accession-numbers-batch-001.csv and https://github.com/emo-bon/sequencing-data/blob/main/shipment/batch-001/run-information-batch-001.csv (and ditto for all batches), is that enough, is it one of these that you use to match to MGF?

Yes, I use the run-information-batch-00{X}.csv files to match the "source_mat(erial)_id"'s from the samling sheets and get the ref_codes:

https://raw.githubusercontent.com/emo-bon/sequencing-data/main/shipment/batch-001/run-information-batch-001.csv

https://raw.githubusercontent.com/emo-bon/sequencing-data/main/shipment/batch-002/run-information-batch-002.csv

C.

--


Cymon J. Cox

Senior Researcher Plant Systematics and Bioinformatics Digital Laboratory Centro de Ciencias do Mar (CCMAR) - CIMAR-Lab. Assoc.

Mailing address: CCMAR - Centro de Ciencias do Mar, Universidade do Algarve Campus de Gambelas Edif. 7 8005-139 Faro Portugal

Phone: +351 289800051 ext 7380 Fax: +351 289800051 Email: @.***

@CCMAR https://ccmar.ualg.pt/users/cymon Google Scholar https://scholar.google.co.uk/citations?user=f5M7DhkAAAAJ&hl=en&oi=ao Scopus http://www.scopus.com/inward/authorDetails.url?authorID=7402112716&partnerID=MN8TOARS
Orcid http://orcid.org/0000-0002-4927-979X CienciaVitae

https://www.cienciavitae.pt/6B15-9771-1D04 GPG: Public key on keyserver.ubuntu.com


cpavloud commented 1 week ago

Ok, so @melinalou you have updated all the definitions, thank you very much for this.

I did notice though that on EMO_BON_Metadata_Soft_Sediment_EMT21_UVigo, the size_frac_low and size_frac_up columns should be swapped. E.g. in line 6 (but in general, throughout the file), size_frac_low should be 0.5 and size_frac_up should be 1 and so on

Also in EMO_BON_Metadata_Soft_Sediment_UMF_UmU

@kmexter the thing is the vast majority of size fractions in the sediment logsheets is NA..... (and my guess is that this is the reason that ids did not originally include the size fractions) So how are we going to update the equation to create the ids if we only have NAs?

melinalou commented 1 week ago

yes yes that's true, I just considered about it. It's my fault cause I thought they were not such columns in sediment logsheets (I had seen specific cards that have NA in every cell of these columns and I was hooked). I will take a look at each one to correct these columns and I'll be back to confirm it is done.

cymon commented 1 week ago

(Parenthetically: does OOB really not have a water_column observatory https://github.com/emo-bon/governance-data/blob/main/logsheets.csv? I thought they were mandatory?)

On Wed, 4 Sept 2024 at 12:14, melinalou @.***> wrote:

yes yes that's true, I just considered about it. It's my fault cause I thought they were not such columns in sediment logsheets. I will take a look at each one to correct these columns and I'll be back to confirm it is done.

— Reply to this email directly, view it on GitHub https://github.com/emo-bon/governance-data/issues/28#issuecomment-2328626632, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAS6V3J5ETZFWH4VMEQGDTZU3TRPAVCNFSM6AAAAABNSBLXTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRYGYZDMNRTGI . You are receiving this because you were mentioned.Message ID: @.***>

--


Cymon J. Cox

Senior Researcher Plant Systematics and Bioinformatics Digital Laboratory Centro de Ciencias do Mar (CCMAR) - CIMAR-Lab. Assoc.

Mailing address: CCMAR - Centro de Ciencias do Mar, Universidade do Algarve Campus de Gambelas Edif. 7 8005-139 Faro Portugal

Phone: +351 289800051 ext 7380 Fax: +351 289800051 Email: @.***

@CCMAR https://ccmar.ualg.pt/users/cymon Google Scholar https://scholar.google.co.uk/citations?user=f5M7DhkAAAAJ&hl=en&oi=ao Scopus http://www.scopus.com/inward/authorDetails.url?authorID=7402112716&partnerID=MN8TOARS
Orcid http://orcid.org/0000-0002-4927-979X CienciaVitae

https://www.cienciavitae.pt/6B15-9771-1D04 GPG: Public key on keyserver.ubuntu.com


melinalou commented 1 week ago

@cpavloud the changes have been made! I hope everything is in the right way now.

cpavloud commented 1 week ago

@cymon you are right, but still there are observatories with no water sampling.