cio-abcd / variantinterpretation

Collaborative Interpretation-Pipeline workflow based on nf-core pipeline structure
MIT License
7 stars 1 forks source link

Extension of the workflow to generate TSV output using vembrane #16

Closed biolancer closed 1 year ago

biolancer commented 1 year ago

The PR is linked to issue #7

Added

Changed

Comments

sci-kai commented 1 year ago

I encountered some problems and have some optimizations:

  1. The filter_vep module is broken. This was due to filter_vep not finding the CSQ string since it was changed to 'ANN'. Changed those back to CSQ and adapted vembrane options to use the CSQ string, which worked for me.
  2. I added the option to deactivate creating a TSV output as parameter tsv.
  3. The header line in the TSV output should be more clear, e.g. the very repetitive ANN[""] in every column name should be removed. There is an option in vembrane for renaming the header.
  4. The current solution with printing all columns by explicitely naming them in the nextflow.config is problematic, as this is specific to the input VCF (in this case our test dataset) and very lengthy. I am currently working on implementing a parameter like "--annotation = all" for vembrane table that detects all columns (from the VCF header) and automatically includes them.
  5. (Optional) The output gives a separate line for each transcript. I always preferred one line per variant with multiple transcripts annotated, similar to bcftools split-vep. We can implement it like this for now, but maybe add this as additional feature in the future.
biolancer commented 1 year ago

Sounds good, I wasn't aware this broke the filter_vep. Thanks for the improvements, let me know if I can help!

sci-kai commented 1 year ago

So I added the changed for points 3 & 4. It works with and without transcript filtering and I also checked that no variants are silently dropped. If @feiloo approves this, we can merge it with dev.

feiloo commented 1 year ago

I just tried running it, but it fails to find the biocontainers/bcftools:1.17--haef29d1_0 container because quay.io is not in our servers "unqualified registries" (see https://unix.stackexchange.com/questions/701784/podman-no-longer-searches-dockerhub-error-short-name-did-not-resolve-to-an). I think it would be overall a good practice to always fully specify the registry, container versions and paths explicitely.

Good: quay.io/biocontainers/bcftools:1.17--haef29d1_0

Bad: biocontainers/bcftools biocontainers/bcftools:1.17--haef29d1_0 quay.io/biocontainers/bcftools:latest

This is because there are spoofing risks of using "unqualified registries". E.g. someone could create docker.io/biocontainers/bcftools to spoof quay.io/biocontainers/bcftools.

This way we also would always know what container versions are actually used. Nice read: https://vsupalov.com/docker-latest-tag/

(edit: correct the comment issue to be the registry name instead of the container tag)