storage of data on NCBI/EMBL

lzinger commented 4 years ago

last § of section 1.4: Maybe to tone down or reformulate a bit. There are actually numerous studies for which data are not stored in these repositories because these were built so that to host metabarcoding data generated in a particular way (each sequencing library is a different sample). Many scientists use another sample tagging strategy (during the PCR, and therefore sequence only one library), which makes it impossible to fulfil NCBI/EMBL requirements. In this case, authors use alternate storage facilities, such as Dryad Digital Repository or Zenodo.

dschigel commented 3 years ago

This could be added to critical review of the sequence repositories as requested in https://github.com/gbif/doc-publishing-dna-derived-data/issues/37 #37. However, this is the data behavior we want to discourage, sequence data needs to be added to INSDC, and only then resurfaced (repackaged, enforced, better metadata etc.) and indexed elsewhere, including general repo. We need to find good words for that, as user situations describer here may be not uncommon. In general, if discovery systems requirements are not fit for users' data, far to often users retire to drop-and-forget solutions as described above. While open data yes, I and R in FAIR are compromized. I agree needs reformulation and maybe additions.

CecSve commented 3 years ago

@dschigel do you know which of the INSDC databases to submit to if you have metabarcoding data? As Izinger mentions, I can't see which database works with 'many-samples-per-library' and it is difficult to encourage submission of data that is in practice not feasible...

dschigel commented 3 years ago

I used https://www.ncbi.nlm.nih.gov/sra and did not hear that much have changed in that respect. For Pensoft journals, including MBMG, submission instructions for sequence data: https://riojournal.com/article/12431/instance/3566238/

A quick check of instructions to author of Molecular Ecology returns a slightly more compact version of the same advice:

"Sequence Data Nucleotide sequence data can be submitted in electronic form to any of the three major collaborative databases: DDBJ, EMBL, or GenBank. It is only necessary to submit to one database as data are exchanged between DDBJ, EMBL, and GenBank on a daily basis. The suggested wording for referring to accession-number information is: ‘These sequence data have been submitted to the DDBJ/EMBL/GenBank databases under accession number U12345’."

CecSve commented 3 years ago

Great - I was about to suggest SRA but since I haven't tried to submit my data yet I didn't want to encourage to use that database. Then I think the last paragraph in 1.4 is sufficient, but perhaps add the Pensoft submission instructions since they give a nice overview e.g. for early-career scientists? I suggest editing the final paragraph in section 1.4 to this (bold = change):

"Note that most holders of genetic sequence data are expected to upload and archive genetic sequence data in raw sequence data repositories such as NCBI’s SRA or EMBL’s ENA. This topic is not covered here, but e.g. Penew et al. (2017) provide a general overview of the importance of data submission and guidelines in connection with scientific publication. Biodiversity data platforms such as ALA, GBIF, and most national biodiversity portals are not archives or repositories for raw sequence reads and associated files. We do, however, stress the importance of maintaining links between such primary data and derived occurrences in Section 2."

dschigel commented 3 years ago

Looks good, a few small edits to your version:

remove "most"
add DDBJ after ENA; as three pillars of INSDC are equal and Japanese mirror should not be left unmentioned.
This topic -> The sequence archival topic
add ref to Penev et al. ref to the list of guide's reference

gbif / doc-publishing-dna-derived-data

storage of data on NCBI/EMBL #76