epi2me-labs / wf-human-variation

Other
87 stars 41 forks source link

no str results with ENSEMBLE GRCh38 fasta file #103

Closed asalhab closed 4 months ago

asalhab commented 8 months ago

Operating System

Other Linux (please specify below)

Other Linux

No response

Workflow Version

v1.8.3

Workflow Execution

Command line

EPI2ME Version

No response

CLI command run

nextflow run epi2me-labs/wf-human-variation -r v1.8.3 -w /gpfs/scratch/ONT/0201382701 -profile singularity -c wf-human-variation-config.cfg --snp --sv --cnv --mod --str --mapula --phase_vcf --phase_mod --GVCF --joint_phasing --bam 0201382701.merged.bam --ref Homo_sapiens.GRCh38.dna.toplevel.110.fa --basecaller_cfg dna_r10.4.1_e8.2_400bps_hac@v4.2.0 --sample_name 0201382701.merged --sex male --out_dir /data/0201382701/2D_PAS66250_9ff9ec6a.0201382701/hg38/wfhv.1.8.3 --threads 8 --ubam_map_threads 16 --merge_threads 8 --ubam_bam2fq_threads 8

Workflow Execution - CLI Execution Profile

singularity

What happened?

The run finished succesfully. All expected results have been generated except the short tandem repeats results. My guess is that because I used enemble GRCh38 genome (which has no "chr" prefix), while the files ariant_catalog_hg38.json and wf_str_repeats.bed have "chr" in chromosme names. In a different run where I used a fasta file that I downloaded from UCSC (chromosmes have "chr" prefix), the str results were generated. Is there a way to provide these files as arguments? or modify the pipeline to deal with "chr" prefix?

Thanks, Abdulrahman

Relevant log output

.

Application activity log entry

No response

vlshesketh commented 8 months ago

Hi @asalhab, thank you for reporting this. As you suggest, it is likely that the discrepancy in chromosome naming is the reason you didn't get any STR results, but I'll confirm this and let you know when we have issued a fix.

asalhab commented 8 months ago

Thank you @vlshesketh

asalhab commented 7 months ago

any update @vlshesketh on this issue?

vlshesketh commented 7 months ago

Hi @asalhab apologies for the delay with this - a fix will be released within the next couple of weeks.

vlshesketh commented 4 months ago

Hi @asalhab, I'm sorry that it has taken a while to respond to this.

We wanted to address the problem you reported, but due to variations in human genome versions, supporting all possible genomes and builds is challenging for us. Having done some testing with the Ensembl genome and the --str subworkflow, the results are not always consistent due to some supplementary alignments skewing the called STRs and generating false positives. As a result, we will instead be making a recommendation in the documentation for wf-human-variation regarding genome selection when working with human data, following the advice set out in this blog post: https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use. As you have already noticed, the repeats BED file is based on a genome with chr prefixes, and to preserve the integrity of the analysis, we feel it's safer not to modify this file.