epam / fonda

Fonda is a framework which offers scalable and automatic analysis of multiple NGS sequencing data types
Apache License 2.0
8 stars 2 forks source link

Fix @RG tag in sam/bam file output #196

Open syansanofi opened 3 years ago

syansanofi commented 3 years ago

Issue 1 Currently FONDA does not discriminate between lanes of a single sample. All lanes receive identical @RG ID: tags

Approach Since alignment are done on a per lane basis for DNA based workflows (eg DNACapVar_Fastq), add lane number to read group. This would align more to standard practice (link)

Example _samplemanifest.txt parameterType shortName Parameter1 Parameter2
fastqFile SampleA SampleA_S1_L001_R1_001.fastq.gz SampleA_S1_L001_R2_001.fastq.gz
fastqFile SampleA SampleA_S2_L002_R1_001.fastq.gz SampleA_S2_L002_R2_001.fastq.gz
The @RG ID: tag would be: parameterType
fastqFile SampleA_L001
fastqFile SampleA_L002

I would rather the lane numbers are iterated and appended onto the sample name:

SampleA+L001

rather than pulled out of the longest common substring of the sample's reads. This will make the lane numbering consecutive and easier to enforce because there will be no dependency on sample name prefixes.

Please let me know if this is clear.

Issue 2 All workflows should get the LB tag instead of only amplicon seq. Rationale follows previous, to align with current best practice.

https://github.com/epam/fonda/blob/4a651caa0ab4bdb4ff92516d2294331c9723f134/src/main/java/com/epam/fonda/tools/impl/BwaSort.java#L108-L110

https://github.com/epam/fonda/blob/4a651caa0ab4bdb4ff92516d2294331c9723f134/src/main/java/com/epam/fonda/tools/impl/NovoalignSort.java#L117-L119

Approach Remove this check, use @RG\\tID:%s\\tSM:%s\\tLB:%s\\tPL:Illumina for all workflows.