epi2me-labs / wf-human-variation

Other
86 stars 41 forks source link

Process `getGenome (1)` terminated with an error exit status #166

Closed Aravind-mss closed 2 months ago

Aravind-mss commented 3 months ago

Ask away!

I am unable to progress any further after an error at getGenome() step, similar to one of the previous questions about using the pipeline for non-model genomes. However, in my case, it's with the Human genome. I tried a couple of genome builds (hg38 & T2Tv2) and it aborts at the same step of getGenome(). I tried with both mapped and unmapped BAM as input, but no luck. Any pointers to get around this issue are much appreciated. Thanks.

I tried even with --annotation false, as suggested in the non-model genomes thread. But the issue persists. Please see my command below and help me with this. nextflow run epi2me-labs/wf-human-variation --basecaller_cfg 'dna_r10.4.1_e8.2_400bps_hac@v4.2.0' --mod --ref $mydir/reference/GRCh38.p14_genomic.fasta --sample_name 'PGXXXF7' --threads 36 --snp --str --phased --annotation false --include_all_ctgs --bam $mydir/PGXXXF7-hg38mapped.sort.bam --out_dir $mydir/PGXXXF7_output_hg38mappedBAM -profile singularity

Thanks.

RenzoTale88 commented 3 months ago

Hi @Aravind-mss, to skip the getGenome step you need to set --annotation false --cnv false --str false. You can find the guidelines on how to run it on unsupported genomes here. However, I can see that you are using a GRCh38 genome, so that step should complete successfully. Can you share the URL of the input reference genome?

Aravind-mss commented 3 months ago

Hi @RenzoTale88 Thanks for your response. I would like to perform the annotation well, as I running the pipeline for human samples. Here is the URL of the input reference genome I downloaded: https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz

RenzoTale88 commented 3 months ago

Hi @Aravind-mss the issue with the reference genome you're using is that it has the chromosome coding not following the chrN or N pattern:

$ wget -O - https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz | zcat | head
--2024-03-26 09:50:45--  https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz
Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... 130.14.250.12, 130.14.250.11, 2607:f220:41e:250::13, ...
Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)|130.14.250.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 972898531 (928M) [application/x-gzip]
Saving to: ‘STDOUT’

-                                        0%[                                                                             ] 216.00K   499KB/s              
>NC_000001.11 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

You need to provide a genome with the appropriate chromosome coding in order to be correctly identified by the workflow.

Aravind-mss commented 3 months ago

Hi @RenzoTale88 Thanks for that. I will change to ChrN pattern and try. Does the pipeline still work with unmappedBAM or do I need to align the reads up front?

RenzoTale88 commented 3 months ago

The workflow accepts either a mapped or an unmapped BAM as input. If it is unmapped, it will perform the alignment internally.

RenzoTale88 commented 3 months ago

@Aravind-mss you can find instructions and recommendations regarding the reference genome in the README input section

Aravind-mss commented 2 months ago

@RenzoTale88 I could run the workflow successfully with hg38 reference build with the ChrN pattern including annotation. Thanks for your help. However, when I try the same with a T2T genome build (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_009914755.1) it fails at the getGenome() stage w/w.o --annotation false. I have provided this genome with the appropriate chromosome coding (chrN pattern). I want to get this T2T build working with the pipeline. Any pointers are much appreciated.

RenzoTale88 commented 2 months ago

@Aravind-mss, as shown in the README input section, and more in detail in the genome compatibility section, the T2T genome is supported only for the SNP, SV and MOD components.