cortes-ciriano-lab / savana

Somatic structural variant caller for long-read data
Apache License 2.0
46 stars 2 forks source link

Failed to read reference bug #25

Closed zhemingfan closed 1 year ago

zhemingfan commented 1 year ago

Hi @helrick,

Hope you're doing well. I downloaded the image from https://quay.io/repository/biocontainers/savana?tab=tags with singularity pull docker://quay.io/biocontainers/savana:1.0.0--pyhdfd78af_0. Running the following command:

singularity exec -B /projects,/home envs/savana_1.0.0--pyhdfd78af_0.sif savana --tumour tumour.cram --normal tumour.bam --outdir output/savana_out_dir --ref hg38_no_alt.fa --threads 72 --length 50 --sample SAMPLE

led to this log:


██████  █████  ██    ██  █████  ███    ██  █████
██      ██   ██ ██    ██ ██   ██ ████   ██ ██   ██  
███████ ███████ ██    ██ ███████ ██ ██  ██ ███████
     ██ ██   ██  ██  ██  ██   ██ ██  ██ ██ ██   ██  
███████ ██   ██   ████   ██   ██ ██   ████ ██   ██  

Version 1.0.0 - beta
Source: /usr/local/lib/python3.10/site-packages/savana/savana.py

Running as sample SAMPLE
Using path/hg38_no_alt.fa.fai as reference fasta index
Using multiprocessing with 72 threads

Submitting 195 "get_potential_breakpoints" tasks to 72 worker threads
Identified potential breakpoints        613.819 seconds
Clustered potential breakpoints         32.113 seconds
Called consensus breakpoints            39.174 seconds
Length after: 49930
Total breakpoints: 49930 (18851 insertions)
Using 15 as binsize, there are 5407 redistributed intervals
Max binsize 29, min binsize 1
Setting maxtasksperchild to 344 
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/f98db672eb0993dcfdabafe2a882905c": Broken pipe
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/f98db672eb0993dcfdabafe2a882905c": Broken pipe
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/f98db672eb0993dcfdabafe2a882905c": Broken pipe
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/f98db672eb0993dcfdabafe2a882905c": Broken pipe
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/f98db672eb0993dcfdabafe2a882905c": Broken pipe
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/f98db672eb0993dcfdabafe2a882905c": Broken pipe
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/f98db672eb0993dcfdabafe2a882905c": Broken pipe
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/f98db672eb0993dcfdabafe2a882905c": Broken pipe
[E::easy_errno] Libcurl reported error 52 (Server returned nothing (no headers, no data))
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/f98db672eb0993dcfdabafe2a882905c": Input/output error

Happy to share anything else to help you troubleshoot this.

helrick commented 1 year ago

Hi there! This may be related to the use of cram files (also raised in #23). Currently cram files are not supported, only bam files. It looks like the error is happening when SAVANA is trying to access the cram file. If you try with bam files does this resolve the issue?

I'll also look into adding cram support for a future release :)

zhemingfan commented 1 year ago

Hi @helrick,

Thanks for the quick response and adding in CRAM file support. It was a little unintuitive to me that both normal/tumour pair had be the same file type. Something like this could be a quick fix?

aln_files = {
'tumour': pysam.AlignmentFile(args.tumour, "rb") if args.tumour.endswith('bam') else pysam.AlignmentFile(args.tumour, "rc") if args.tumour.endswith('cram') else sys.exit('Unrecognized file extension. Tumour and normal files must be BAM/CRAM'),
'normal': pysam.AlignmentFile(args.normal, "rb") if args.normal.endswith('bam') else pysam.AlignmentFile(args.normal, "rc") if args.normal.endswith('cram') else sys.exit('Unrecognized file extension. Tumour and normal files must be BAM/CRAM')
}

Otherwise, if this is intended behaviour - that would be helpful to know too :) I've closed the ticket. Thank you!

helrick commented 1 year ago

Hi @zhemingfan,

Unfortunately, since CRAM files don't store information about the number of mapped reads per contig, the very first step (finding potential breakpoints from aligned reads) needs to be parallelized differently depending on whether BAM or CRAM files were used (see code link)

https://github.com/cortes-ciriano-lab/savana/blob/041435b4570bdeb78955d1c0861306d5379ac333/savana/run.py#L35-L83

I can see a way to avoid this, either by using the CRAM parallelization method when at least one files is a CRAM, or parallelizing differently depending on the filetype, but it would require a bit of re-engineering. To avoid this, I made the assumption that people would be storing their tumour and normal files in the same format. If this isn't the case though I can look into adding this feature in future.

zhemingfan commented 1 year ago

Thank you @helrick for the update!

I was able to resolve this internally - it's not too much of an issue from our end.