Refactor pipeline to move away from use of sed (failure during phylogeny snp-sites step)

abhi18av commented 1 year ago

Context: Running the pipeline on a macOS machine with condo-local environment

The resulting FASTA didn’t have any variants owing to small differences between sed on Linux and macOS leading to a failure at snp-sites process since no SNPs can be detected.

I strongly recommend that we move from sed to python script in this https://github.com/TORCH-Consortium/MAGMA/blob/e5011cbb8675e82ef626687733b00491f5da6ef8/modules/gatk/variants_to_table.nf#L22-L29

TimHHH commented 1 year ago

@LennertVerboven Could we bluejeans about this some time to discuss how this can be translated to python? We just need a recent MAGMA run to generate the preceding GATK VariantsToTale output as an example.

LennertVerboven commented 1 year ago

Yes,

I am available the entire day, send me a text message when you want to meet

LennertVerboven commented 1 year ago

We use sed twice more when annotating vcf files (see here and here)

The entire code in this workflow can be changed by the following command after activating magma-env-1 or loading the proper container. python3 ../../../bin/rename_vcf_chrom.py --vcf ./PRJEB47429.raw_variants.vcf.gz --source 'NC-000962-3-H37Rv' --target 'Chromosome' | snpEff -nostats -ud 100 Mycobacterium_tuberculosis_h37rv | python3 ../../../bin/rename_vcf_chrom.py --target 'NC-000962-3-H37Rv' --source 'Chromosome' > ./PRJEB47429.raw_variants.annotated.vcf

You would need to change it back to use the parameter instead of the hardcoded params and paths you see here, which should look like this ( still need to change for the script and python)

python3 ../../../bin/rename_vcf_chrom.py --vcf !{rawJointVariantsFile} --source !{ref_fasta.getBaseName()} --target 'Chromosome' | !{params.snpeff_path} !{params.arguments} | python3 ../../../bin/rename_vcf_chrom.py --target !{ref_fasta.getBaseName()} --source 'Chromosome' > !{joint_name}.raw_variants.annotated.vcf

I have also added the required python file in the bin folder

TimHHH commented 1 year ago

Nice Lennert.

You can generate an example output as such: /home/shared_data_ghi/CWGSMTB/Software/jre-8.131/bin/java -jar /home/shared_data_ghi/CWGSMTB/Software/gatk-4.2.4.1/gatk-package-4.2.4.1-local.jar VariantsToTable -V /home/shared_data_ghi/CWGSMTB/xbs-nf_output/xbs-nf-SMARTT_branch/MAGMA/work/3a/d9e45a085c6badfe57fa3fa859a54a/joint.ExDR.IncComplex_SNP.vcf.gz -GF GT -O /dev/stdout

This output is then piped through various sed commands with the following purposes:


sed -e '2~1 s/*/-/g' # replaces * with - but not in header row
sed -e '2~1 s/\\./-/g' # replaces . with - but not in header row
sed '2,${/^.*\\(-.*\\)\\{'"!{params.median_coverage_cutoff}"',\\}.*$/d}'  # remove rows with less than 95% representation
!{params.datamash_path} transpose # transposes columns to rows
sed -e 's/^/>/g' # add > before sample names
sed -e 's/\\.GT/\\n/g' # replace .GT sample name ending with newline
sed -e 's/\\t//g' # removes tabs everywhere```

Let's discuss.

abhi18av commented 1 year ago

Thanks for the effort guys! 🎉

I think once the https://github.com/TORCH-Consortium/MAGMA/blob/ad887bb8d450483601c93230c9bf123a4af0633d/modules/gatk/variants_to_table.nf#L22-L29 is refactored, we can do the test and then merge 👍

P.S. To avoid direct pushes to master, I have create a Ruleset for the same. Since the pipeline is being used in multiple context, pushing code to master should be avoided.

LennertVerboven commented 1 year ago

I have created a new branch 'replace_sed' and have added the script to convert the table to fasta there. The following two commands should replace everyting in the variants to table wf

!{params.gatk_path} VariantsToTable --java-options "-Xmx!{task.memory.giga}G" -V !{vcf} !{params.arguments} -O !{joint_name}.!{prefix}.table python3 variant_table_to_fasta.py !{joint_name}.!{prefix}.table !{joint_name}.!{prefix}.fa !{params.site_representation_cutoff}

abhi18av commented 1 year ago

Thanks Lennert, I'll test this one today and create the PR 🚧

TORCH-Consortium / MAGMA

Refactor pipeline to move away from use of sed (failure during phylogeny snp-sites step) #163