cortes-ciriano-lab / savana

Somatic structural variant caller for long-read data
Apache License 2.0
43 stars 2 forks source link

KeyError cram support #23

Closed kcleal closed 1 year ago

kcleal commented 1 year ago

Hi, Thanks for the nice tool. Im running into this error which pops up straight away after running:

~/miniconda3/bin/savana --tumour COLO829/PAO32033.cram --normal COLO829BL/PAO33946.cram --outdir savana_out --ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

Version 0.2.3 - beta
Source: /Users/kezcleal/miniconda3/lib/python3.10/site-packages/savana/savana.py

Running as sample PAO32033
Creating directory /Volumes/Kez6T/colo829/kit14/savana_out to store results
Using GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai as reference fasta index
Using multiprocessing with 8 threads

Traceback (most recent call last):
  File "/Users/kezcleal/miniconda3/bin/savana", line 10, in <module>
    sys.exit(main())
  File "/Users/kezcleal/miniconda3/lib/python3.10/site-packages/savana/savana.py", line 281, in main
    consensus_clusters, checkpoints, time_str = spawn_processes(args, bam_files, checkpoints, time_str, outdir)
  File "/Users/kezcleal/miniconda3/lib/python3.10/site-packages/savana/savana.py", line 160, in spawn_processes
    validated_breakpoints = call_breakpoints(clusters, args.buffer)
  File "/Users/kezcleal/miniconda3/lib/python3.10/site-packages/savana/breakpoints.py", line 205, in call_breakpoints
    for insertion_cluster in clustered_breakpoints[bp_type]:
KeyError: '<INS>'
helrick commented 1 year ago

Hi there, thanks for raising this issue. Unfortunately at the moment SAVANA doesn't support the use of .cram files - only .bam. This is functionality I can look into adding though.

For now, if you convert your cram files to bam does this resolve the issue?

kcleal commented 1 year ago

I converted a few Mb to bam and I think it works, thanks. I had a go at changing some of the source code to get it working but not luck yet. Thanks for the quick reply

helrick commented 1 year ago

Hi there, I've implemented CRAM support in the latest commit (v1.0.3 - I'll create a release for it soon). However, SAVANA will run much more slowly on CRAM files than BAM files. This is because I rely on the get_index_statistics functionality of pysam to optimise memory usage. This information isn't available for CRAM files though (see: https://github.com/pysam-developers/pysam/issues/1060).

So I'd still recommend converting to BAM if you'd like faster speeds, but if that's not a concern, CRAM files should now work. If you know your chromosomes of interest, it's highly recommended that you supply them via the --contigs argument. This will prevent worker threads being spawned for thousands of contigs which have no reads mapped to them.