WGS-standards-and-analysis / datasets

Benchmark datasets for WGS analysis
37 stars 18 forks source link

Proposal to extend format to allow for hybrid (short/long sequencing) data #14

Open andersgs opened 2 years ago

andersgs commented 2 years ago

I suggest adding an optional SRArun_acc_long column to the metadata format to support hybrid datasets with short and long sequencing data.

We would need a supporting sha256sumLongRead column too.

retimme commented 2 years ago

ahh, you are proposing an additional column, not to describe long reads specifically, but to capture an additional SRR accession? maybe we make it more generic to maximize it's use?

lskatz commented 2 years ago

I've been thinking about this more and more, and I think from This tweet , Justin is right. Every new technology introduces a new column to the scheme, forcing new versions of this repo. I think I'd rather leave it as is. It will force duplicate rows of metadata but that would only be overcome by yet another table which I'd also rather not design. @retimme does at make sense to you too?

retimme commented 2 years ago

yeah, i'm torn between solving this for immediate use and coming up with better standard that will stand up better over time. We could bring this up in the PHA4GE data_structures working group for ideas modernizing the structure? I don't need to be involved in the solution (my plate is full) - @andersgs, what do you think?

andersgs commented 2 years ago

I am happy with that. Possibly a JSON format might be better suited to the task. I am, however, running my own battle with cancer. So, I am not really able to spear head this. I could check with Emma to see what she thinks.

andersgs commented 2 years ago

My goal was to instigate a debate more than arrive at a solution. 😎

retimme commented 2 years ago

understood, @andersgs. I'm sorry to hear that. I agree that this should be moved to JSON. Lets ping emma for ideas.

andersgs commented 2 years ago

Thank you @retimme... I have been thinking about how this may work and came up with this YAML file as perhaps a launching pad for discussions.

---
name: my dataset
n_samples: 1
description: |
 An example dataset
pmid: 123456
test: hybrid assemblies
data:
 - biosample: SAMEA123456
   collection_date: 2000-01-01
   organism: Klebsiella pneumoniae
   reads:
   - accession: SRR123456
     url: ftp://ftp.ena/SRR123456_1.fastq.gz 
     md5sum: gahquhgq174
     bytes: 9999
     read_type: short
     library_format: paired-end
     library_strategy: shotgun
   - accession: SRR123456
     url: ftp://ftp.ena/SRR123456_2.fastq.gz 
     md5sum: hjutd1356
     bytes: 9999
     read_type: short
     library_format: paired-end
     library_strategy: shotgun
   - accession: SRR123457
     url: ftp://ftp.ena/SRR123457.fastq.gz 
     md5sum: gujrsa12367
     bytes: 98989
     read_type: long
     library_format: single-end
     library_strategy: shotgun
   tests:
   - name: assembly length
     expected_value: 5,146,787
     method: unicycler hybrid assembly 
     tools:
     - name: unicycler
       version: 0.5.7
       cmd_opts: “-m”
   - name: number of circular plasmids
     expected_value: 3
     method: unicycler hybrid assembly
     tools:
     - name: unicycler
       version: 0.5.7
       cmd_opts: “-m”

There are essentially four bits.

A preamble with data about the dataset:

name: my dataset
n_samples: 1
description: |
 An example dataset
pmid: 123456
test: hybrid assemblies

Then a data section that has three elements per sample.

First information about the sample data:

 - biosample: SAMEA123456
   collection_date: 2000-01-01
   organism: Klebsiella pneumoniae

And, then information about the read data:

   reads:
   - accession: SRR123456
     url: ftp://ftp.ena/SRR123456_1.fastq.gz 
     md5sum: gahquhgq174
     bytes: 9999
     read_type: short
     library_format: paired-end
     library_strategy: shotgun
   - accession: SRR123456
     url: ftp://ftp.ena/SRR123456_2.fastq.gz 
     md5sum: hjutd1356
     bytes: 9999
     read_type: short
     library_format: paired-end
     library_strategy: shotgun
   - accession: SRR123457
     url: ftp://ftp.ena/SRR123457.fastq.gz 
     md5sum: gujrsa12367
     bytes: 98989
     read_type: long
     library_format: single-end
     library_strategy: shotgun

Finally, information about the test and expected result for the sample:

   tests:
   - name: assembly length
     expected_value: 5,146,787
     method: unicycler hybrid assembly 
     tools:
     - name: unicycler
       version: 0.5.7
       cmd_opts: “-m”
   - name: number of circular plasmids
     expected_value: 3
     method: unicycler hybrid assembly
     tools:
     - name: unicycler
       version: 0.5.7
       cmd_opts: “-m”

With this last tests section being per sample.

It is a little more descriptive, but it provides various elements that would allow for comparison. Allows for multiple tests per sample. One could imagine someone releases a dataset with typing info (e.g., MLST and serotyping) per sample. Or, detection of different AMR profiles per sample. So, expanding on the initial idea of using for phylogenetic-based surveillance.

Curious to hear what you guys think.