Open andersgs opened 2 years ago
ahh, you are proposing an additional column, not to describe long reads specifically, but to capture an additional SRR accession? maybe we make it more generic to maximize it's use?
I've been thinking about this more and more, and I think from This tweet , Justin is right. Every new technology introduces a new column to the scheme, forcing new versions of this repo. I think I'd rather leave it as is. It will force duplicate rows of metadata but that would only be overcome by yet another table which I'd also rather not design. @retimme does at make sense to you too?
yeah, i'm torn between solving this for immediate use and coming up with better standard that will stand up better over time. We could bring this up in the PHA4GE data_structures working group for ideas modernizing the structure? I don't need to be involved in the solution (my plate is full) - @andersgs, what do you think?
I am happy with that. Possibly a JSON format might be better suited to the task. I am, however, running my own battle with cancer. So, I am not really able to spear head this. I could check with Emma to see what she thinks.
My goal was to instigate a debate more than arrive at a solution. 😎
understood, @andersgs. I'm sorry to hear that. I agree that this should be moved to JSON. Lets ping emma for ideas.
Thank you @retimme... I have been thinking about how this may work and came up with this YAML file as perhaps a launching pad for discussions.
---
name: my dataset
n_samples: 1
description: |
An example dataset
pmid: 123456
test: hybrid assemblies
data:
- biosample: SAMEA123456
collection_date: 2000-01-01
organism: Klebsiella pneumoniae
reads:
- accession: SRR123456
url: ftp://ftp.ena/SRR123456_1.fastq.gz
md5sum: gahquhgq174
bytes: 9999
read_type: short
library_format: paired-end
library_strategy: shotgun
- accession: SRR123456
url: ftp://ftp.ena/SRR123456_2.fastq.gz
md5sum: hjutd1356
bytes: 9999
read_type: short
library_format: paired-end
library_strategy: shotgun
- accession: SRR123457
url: ftp://ftp.ena/SRR123457.fastq.gz
md5sum: gujrsa12367
bytes: 98989
read_type: long
library_format: single-end
library_strategy: shotgun
tests:
- name: assembly length
expected_value: 5,146,787
method: unicycler hybrid assembly
tools:
- name: unicycler
version: 0.5.7
cmd_opts: “-m”
- name: number of circular plasmids
expected_value: 3
method: unicycler hybrid assembly
tools:
- name: unicycler
version: 0.5.7
cmd_opts: “-m”
There are essentially four bits.
A preamble with data about the dataset:
name: my dataset
n_samples: 1
description: |
An example dataset
pmid: 123456
test: hybrid assemblies
Then a data section that has three elements per sample.
First information about the sample data:
- biosample: SAMEA123456
collection_date: 2000-01-01
organism: Klebsiella pneumoniae
And, then information about the read data:
reads:
- accession: SRR123456
url: ftp://ftp.ena/SRR123456_1.fastq.gz
md5sum: gahquhgq174
bytes: 9999
read_type: short
library_format: paired-end
library_strategy: shotgun
- accession: SRR123456
url: ftp://ftp.ena/SRR123456_2.fastq.gz
md5sum: hjutd1356
bytes: 9999
read_type: short
library_format: paired-end
library_strategy: shotgun
- accession: SRR123457
url: ftp://ftp.ena/SRR123457.fastq.gz
md5sum: gujrsa12367
bytes: 98989
read_type: long
library_format: single-end
library_strategy: shotgun
Finally, information about the test and expected result for the sample:
tests:
- name: assembly length
expected_value: 5,146,787
method: unicycler hybrid assembly
tools:
- name: unicycler
version: 0.5.7
cmd_opts: “-m”
- name: number of circular plasmids
expected_value: 3
method: unicycler hybrid assembly
tools:
- name: unicycler
version: 0.5.7
cmd_opts: “-m”
With this last tests
section being per sample.
It is a little more descriptive, but it provides various elements that would allow for comparison. Allows for multiple tests
per sample. One could imagine someone releases a dataset with typing info (e.g., MLST and serotyping) per sample. Or, detection of different AMR profiles per sample. So, expanding on the initial idea of using for phylogenetic-based surveillance.
Curious to hear what you guys think.
I suggest adding an optional
SRArun_acc_long
column to the metadata format to support hybrid datasets with short and long sequencing data.We would need a supporting
sha256sumLongRead
column too.