bahlolab / superSTR

A lightweight, alignment-free utility for detecting repeat-containing reads in short-read WGS, WES and RNA-seq data.
GNU General Public License v2.0
17 stars 7 forks source link

Preprocessing warning message #10

Open gspirito opened 2 years ago

gspirito commented 2 years ago

Hi, I ran superSTR preprocessing on a cohort of WGS samples (in bam format). The output seems ok for all samples, however, for some of them I get this message in the log:


[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/90a8a5267a890ccd78717e88de59232d": Protocol not supported
[E::fai_build3_core] Failed to open the file /gpfs/internal/sweng/production/Resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa
[E::refs_load_fai] Failed to open reference file '/gpfs/internal/sweng/production/Resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa'
[E::cram_get_ref] Failed to populate reference for id 3241
[E::cram_next_slice] Failure to decode slice
Successful run; processed 793068206 reads.

Is it safe to proceed with the analysis anyway?

Thanks.

lfearnley commented 2 years ago

That's odd for a BAM - HTSlib seems to be treating this file as though it's a CRAM, then throwing an error when it tries to download reference sequences from ENA.

Is there any chance I could get you to please provide one of the commands that's generating the error?

gspirito commented 2 years ago

Hi, thanks for the answer, actually some of the files in my cohort are cram and others are bam, however, in the manual I saw that I should use "--mode=bam" for both cram and bam; furthermore the error occurs only for some cram files and not other cram.

Here's the command I use:

superstr --mode=bam -o /homenfs/user/HG01813/ -t 0.49 /homenfs/user/HG01813.final.cram

Thanks

lfearnley commented 2 years ago

Yep! You've got the correct command.

The issue is in htslib and how it reads CRAMs (which superSTR relies on, as well as samtools and many other pieces of software). htslib is failing to download the reference sequence that you've aligned to for some reason from ENA; this happens sometimes transiently I've found, but can also happen in server/workflow contexts.

If you're aligned to GRCh38, running the following should work (from here):

wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai 
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.ref_cache.tar.gz

tar xzf Homo_sapiens_assembly38.ref_cache.tar.gz
export REF_PATH="$(pwd)/ref/cache/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s"
export REF_CACHE="$(pwd)/ref/cache/%2s/%2s/%s"

Full documentation is available here. If you're running in an HPC, you'll need to make sure that REF_CACHE and REF_PATH are correctly set in the job environment.

Unfortunately, you'll need to re-run those CRAMs for which the error was generated :(

lfearnley commented 2 years ago

Oh, as a note to myself - I'll add some comments on this in the documentation before closing this out.

If there's any other issues you hit with superSTR let me know!