Extension of the workflow to generate TSV output using vembrane

biolancer commented 1 year ago

The PR is linked to issue #7

Added

The two modules bcftools index and bcftools norm from the nf-core suite have been added.
bcftools index runs with the -t argument to generate .tbi - index files in the first step and pipes them to bcftools norm, which requires the tbi-indices for normalization.
bcftools norm takes multiallelic sites and converts them to biallelic sites for compatibility with downstream vembrane processes. bcftools norm runs with the -m-any (splitting multiallelic sites) and --do-not-normalize (suppresses InDel cross-referencing to reference and left-alignment of indels based on provided fasta-file) arguments to preprocess the input VCF-files. By adding the --do-not-normalize argument to bcftools norm, left-alignment of InDels will be suppressed. The VCF input will not be atomized.
Added vembrane table as module for converting VEP-annotated VCF output into CSV files. The desired CSV-output fields can be provided using the --extraction_fields option. A provisional minimal default for extraction_fields was set to CHROM, POS, REF, ALT

Changed

The ensemblvep/vep module was transferred to local modules due to conflicts in the input-output name. Default output names are linked to the samplename in the samplesheet and the extension .vcf.gz, but if both input and output names are identical, the pipeline breaks. Using --force-overwrite to overwrite input data results in the annotated VCF files to be omitted and subsequent steps in the workflow to be skipped. Transferring the module to add .ann. to the output file extension (as also shown in the enemblvep-main.nf stub) required the move to local modules to ensure nf-core lint doesn't mark the change as error.
The ensemblvep/vep module got an additional argument: --vcf_info_field ANN. This allows direct downwards compatibility with the vembrane suite.
README, Usage, schema and and workflow were adapted for the new modules.

Comments

Integration of bcftools norm using nf-core requires a fasta ref as required input to the module. Even if not used for normalization, the fasta file will be staged from AWS, increasing runtime. A potential workaround would be to generate a local copy of the module which doesn't require the fasta, as it isn't required for downstream processes yet.
Due to the splitting of multiallelic sites into biallelic sites, which increases the total amount of events in the VCF file, VEP annotation runtime increases (unavoidable).

sci-kai commented 1 year ago

I encountered some problems and have some optimizations:

The filter_vep module is broken. This was due to filter_vep not finding the CSQ string since it was changed to 'ANN'. Changed those back to CSQ and adapted vembrane options to use the CSQ string, which worked for me.
I added the option to deactivate creating a TSV output as parameter tsv.
The header line in the TSV output should be more clear, e.g. the very repetitive ANN[""] in every column name should be removed. There is an option in vembrane for renaming the header.
The current solution with printing all columns by explicitely naming them in the nextflow.config is problematic, as this is specific to the input VCF (in this case our test dataset) and very lengthy. I am currently working on implementing a parameter like "--annotation = all" for vembrane table that detects all columns (from the VCF header) and automatically includes them.
(Optional) The output gives a separate line for each transcript. I always preferred one line per variant with multiple transcripts annotated, similar to bcftools split-vep. We can implement it like this for now, but maybe add this as additional feature in the future.

biolancer commented 1 year ago

Sounds good, I wasn't aware this broke the filter_vep. Thanks for the improvements, let me know if I can help!

sci-kai commented 1 year ago

So I added the changed for points 3 & 4. It works with and without transcript filtering and I also checked that no variants are silently dropped. If @feiloo approves this, we can merge it with dev.

feiloo commented 1 year ago

I just tried running it, but it fails to find the biocontainers/bcftools:1.17--haef29d1_0 container because quay.io is not in our servers "unqualified registries" (see https://unix.stackexchange.com/questions/701784/podman-no-longer-searches-dockerhub-error-short-name-did-not-resolve-to-an). I think it would be overall a good practice to always fully specify the registry, container versions and paths explicitely.

Good: quay.io/biocontainers/bcftools:1.17--haef29d1_0

Bad: biocontainers/bcftools biocontainers/bcftools:1.17--haef29d1_0 quay.io/biocontainers/bcftools:latest

This is because there are spoofing risks of using "unqualified registries". E.g. someone could create docker.io/biocontainers/bcftools to spoof quay.io/biocontainers/bcftools.

This way we also would always know what container versions are actually used. Nice read: https://vsupalov.com/docker-latest-tag/

(edit: correct the comment issue to be the registry name instead of the container tag)

cio-abcd / variantinterpretation