Closed abhi18av closed 1 year ago
@LennertVerboven Could we bluejeans about this some time to discuss how this can be translated to python? We just need a recent MAGMA run to generate the preceding GATK VariantsToTale output as an example.
Yes,
I am available the entire day, send me a text message when you want to meet
We use sed twice more when annotating vcf files (see here and here)
The entire code in this workflow can be changed by the following command after activating magma-env-1 or loading the proper container.
python3 ../../../bin/rename_vcf_chrom.py --vcf ./PRJEB47429.raw_variants.vcf.gz --source 'NC-000962-3-H37Rv' --target 'Chromosome' | snpEff -nostats -ud 100 Mycobacterium_tuberculosis_h37rv | python3 ../../../bin/rename_vcf_chrom.py --target 'NC-000962-3-H37Rv' --source 'Chromosome' > ./PRJEB47429.raw_variants.annotated.vcf
You would need to change it back to use the parameter instead of the hardcoded params and paths you see here, which should look like this ( still need to change for the script and python)
python3 ../../../bin/rename_vcf_chrom.py --vcf !{rawJointVariantsFile} --source !{ref_fasta.getBaseName()} --target 'Chromosome' | !{params.snpeff_path} !{params.arguments} | python3 ../../../bin/rename_vcf_chrom.py --target !{ref_fasta.getBaseName()} --source 'Chromosome' > !{joint_name}.raw_variants.annotated.vcf
I have also added the required python file in the bin folder
Nice Lennert.
You can generate an example output as such:
/home/shared_data_ghi/CWGSMTB/Software/jre-8.131/bin/java -jar /home/shared_data_ghi/CWGSMTB/Software/gatk-4.2.4.1/gatk-package-4.2.4.1-local.jar VariantsToTable -V /home/shared_data_ghi/CWGSMTB/xbs-nf_output/xbs-nf-SMARTT_branch/MAGMA/work/3a/d9e45a085c6badfe57fa3fa859a54a/joint.ExDR.IncComplex_SNP.vcf.gz -GF GT -O /dev/stdout
This output is then piped through various sed commands with the following purposes:
sed -e '2~1 s/*/-/g' # replaces * with - but not in header row
sed -e '2~1 s/\\./-/g' # replaces . with - but not in header row
sed '2,${/^.*\\(-.*\\)\\{'"!{params.median_coverage_cutoff}"',\\}.*$/d}' # remove rows with less than 95% representation
!{params.datamash_path} transpose # transposes columns to rows
sed -e 's/^/>/g' # add > before sample names
sed -e 's/\\.GT/\\n/g' # replace .GT sample name ending with newline
sed -e 's/\\t//g' # removes tabs everywhere```
Let's discuss.
Thanks for the effort guys! 🎉
I think once the https://github.com/TORCH-Consortium/MAGMA/blob/ad887bb8d450483601c93230c9bf123a4af0633d/modules/gatk/variants_to_table.nf#L22-L29 is refactored, we can do the test and then merge 👍
P.S. To avoid direct pushes to master, I have create a Ruleset
for the same. Since the pipeline is being used in multiple context, pushing code to master should be avoided.
I have created a new branch 'replace_sed' and have added the script to convert the table to fasta there. The following two commands should replace everyting in the variants to table wf
!{params.gatk_path} VariantsToTable --java-options "-Xmx!{task.memory.giga}G" -V !{vcf} !{params.arguments} -O !{joint_name}.!{prefix}.table
python3 variant_table_to_fasta.py !{joint_name}.!{prefix}.table !{joint_name}.!{prefix}.fa !{params.site_representation_cutoff}
Thanks Lennert, I'll test this one today and create the PR 🚧
Context: Running the pipeline on a
macOS
machine with condo-local environmentThe resulting FASTA didn’t have any variants owing to small differences between sed on Linux and macOS leading to a failure at
snp-sites
process since no SNPs can be detected.I strongly recommend that we move from sed to python script in this https://github.com/TORCH-Consortium/MAGMA/blob/e5011cbb8675e82ef626687733b00491f5da6ef8/modules/gatk/variants_to_table.nf#L22-L29