Document the fields in the ena-accession-numbers-batch-XXX files - Githubissues

emo-bon / sequencing-data

The files controlling and describing the sequencing metadata

Apache License 2.0

0 stars 2 forks source link

Document the fields in the ena-accession-numbers-batch-XXX files #15

Closed cymon closed 1 month ago

cymon commented 1 month ago

Could we have some brief documentation for the different accession numbers in this table?

shipment/batch-001/ena-accession-numbers-batch-001.csv

kmexter commented 1 month ago

what would you like to see wrt documentation? And while I have your attention, we want to add the run accession numbers to that same file, once they exist I suggest one column for the run accession number per gene type? If so, can you remind me what the "gene types" are that we are producing? metagenomics is one "gene type" for the metabarcoding we have 18S, COI, ITS (?) for the ARMS and ?? for Wa and So?

cpavloud commented 1 month ago

What kind of documentation? Do you need something other than what is included here?

cpavloud commented 1 month ago

@kmexter We have the metagenomics and we also have the metabarcoding for the 18S and the metabarcoding for the COI. No ITS.

cymon commented 1 month ago

what would you like to see wrt documentation?

Christina just supplied this information: ena_accession_number_sample: ENA Accession Number of sequence data biosamples_accession_number: BioSamples Accession Number ena_accession_number_project: Observatory ENA Project Accession Number ena_accession_number_umbrella: EMO BON Project Accession Number

And while I have your attention, we want to add the run accession numbers to that same file What is the "run accession number" field? In which table does it occur?

The table as if is currently defined has the "ref_code" that is the unique identifier in the "run-information-batch-001.csv" and the "source_material_id" the (not, but should be)unique identifier linking it to the Google logsheets. I dont think it needs anything else, or rather adding another identifier would be redundant.

, once they exist I suggest one column for the run accession number per gene type? If so, can you remind me what the "gene types" are that we are producing? metagenomics is one "gene type" for the metabarcoding we have 18S, COI, ITS (?) for the ARMS and ?? for Wa and So?

As you note this is metagenomics, not metabarcoding. So there are only on "type" of sequence in metagenomics, and they are not genes - they are just sequence reads.

cymon commented 1 month ago

What kind of documentation? Do you need something other than what is included here?

Nope that's what I was looking for...

cpavloud commented 1 month ago

I would make some corrections, if we want to be 1-1 with the ENA definitions

biosamples_accession_number: Sample Accession Number / Biosample Accession Number (example) ena_accession_number_sample: Secondary Sample Accession Number (example) --> this is like an umbrella sample accession in ENA ena_accession_number_project: Study Accession Number / Project (example) --> In ENA, project == study --> For EMO BON, there will be one project per observatory --> Each project is a component project under the EMO BON Umbrella Study, else known as the Parent Project ena_accession_number_umbrella: EMO BON Umbrella Study Number

kmexter commented 1 month ago

OK, so @cymon what would you like to see that is different to what we have now? Update slightly (following Christinas suggestions) the descriptions in that table, or something in a README.md?

cymon commented 1 month ago

Either, both, or none... I'm good.

Just wanted to know the definition of the fields. The naming of the file "run-information-batch-001_column-descriptions.csv" is misleading as it also contains descriptions of the fields in the "ena-accession-numbers-batch-001.csv file, which I hadt realised. But no big deal.

I've closed this issue above...