Open gspirito opened 2 years ago
That's odd for a BAM - HTSlib seems to be treating this file as though it's a CRAM, then throwing an error when it tries to download reference sequences from ENA.
Is there any chance I could get you to please provide one of the commands that's generating the error?
Hi, thanks for the answer, actually some of the files in my cohort are cram and others are bam, however, in the manual I saw that I should use "--mode=bam" for both cram and bam; furthermore the error occurs only for some cram files and not other cram.
Here's the command I use:
superstr --mode=bam -o /homenfs/user/HG01813/ -t 0.49 /homenfs/user/HG01813.final.cram
Thanks
Yep! You've got the correct command.
The issue is in htslib and how it reads CRAMs (which superSTR relies on, as well as samtools and many other pieces of software). htslib is failing to download the reference sequence that you've aligned to for some reason from ENA; this happens sometimes transiently I've found, but can also happen in server/workflow contexts.
If you're aligned to GRCh38, running the following should work (from here):
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.ref_cache.tar.gz
tar xzf Homo_sapiens_assembly38.ref_cache.tar.gz
export REF_PATH="$(pwd)/ref/cache/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s"
export REF_CACHE="$(pwd)/ref/cache/%2s/%2s/%s"
Full documentation is available here. If you're running in an HPC, you'll need to make sure that REF_CACHE and REF_PATH are correctly set in the job environment.
Unfortunately, you'll need to re-run those CRAMs for which the error was generated :(
Oh, as a note to myself - I'll add some comments on this in the documentation before closing this out.
If there's any other issues you hit with superSTR let me know!
Hi, I ran superSTR preprocessing on a cohort of WGS samples (in bam format). The output seems ok for all samples, however, for some of them I get this message in the log:
Is it safe to proceed with the analysis anyway?
Thanks.