Add info about the sequencing tech TAG and reflect that on the reports

TORCH-Consortium / MAGMA

A pipeline for comprehensive genomic analyses of Mycobacterium tuberculosis with a focus on clinical decision making as well as research

https://doi.org/10.1371/journal.pcbi.1011648

GNU General Public License v3.0

13 stars 3 forks source link

Add info about the sequencing tech TAG and reflect that on the reports #150

Open abhi18av opened 1 year ago

abhi18av commented 1 year ago

As part of 4-APR meeting.

Focus on homogenous (sequencing) datasets
(IN FUTURE) Accommodate hybrid datasets and reflect on the final results (nanopore/illumina)

@vrennie @TimHHH , where exactly do we need to add this sequencing platform information i.e. which summary files?

TimHHH commented 1 year ago

@vrennie @TimHHH , where exactly do we need to add this sequencing platform information i.e. which summary files?

I would think a column in the summary stats file.

vrennie commented 1 year ago

Yes, I agree with Tim, just a column that looks like this:

Sequencing Technology Illlumina ONT ONT Illumina Illumina Illumina ...

abhi18av commented 1 year ago

Okay, I understand this would be added to the summary stats file 👍

However, there's one more detail worth mentioning here, currently we hard-code the sequencing technology in the bam_rg_string https://github.com/TORCH-Consortium/MAGMA/blob/786d13dfe1988784f870499bb878f47cc945a493/workflows/validate_fastqs_wf.nf#L30

Should we not add this column to the input-samplesheet as well?

vrennie commented 1 year ago

Yes, good catch @abhi18av, lets add this as a column to the samplesheet.

TimHHH commented 1 year ago

Yes, ideally the user provides the sequencing technology in the sample sheet and this is then used in the bam_rg_string along the lines of PL:${technology}. The documentation has to be clear that only one technology is allowed per sample sheet.

abhi18av commented 1 year ago

Guys, what about reflecting that on the actual sample name as well? Something like Shea2017_2021_396.SRR16089406.LNA.A1.ILMN.1.1.1

The NCBI currently lists the following platforms used for the sequences

ILLUMINA
ION_TORRENT
ABI_SOLID
PACBIO_SMRT
CAPILLARY
OXFORD_NANOPORE
LS454
BGISEQ

To avoid long names, we can perhaps standardize the acronyms like ILMN / ONT / PCB / ION etc - what do you think?

vrennie commented 1 year ago

I think unless the full name messes up the .csv its better to keep the full name