Error running pavfinder on a large sample

af8 commented 3 years ago

Hi @readmanchiu

I launched fusionbloom on a large sample (~400M 2*151bp pairs) and I got an error at the last step (pavfinder) :

pavfinder fusion --nproc 20 --only_fusions --include_non_exon_bound_fusion --min_support 2 \
--gbam LIB00002320_S14-c2g.bam \
--tbam LIB00002320_S14-c2t.bam \
--r2c LIB00002320_S14-r2c.bam \
--transcripts_fasta .../fusionbloom-1.7.1-db/gencode.v32.basic.annotation.exon.cds.gtf.longest.fa \
--genome_index .../fusionbloom-1.7.1-db/hg38-au hg38-au \
               rnabloom/LIB00002320_S14.transcripts.filtered.fa \
               gencode.v32.basic.annotation.exon.cds.gtf.gz hg38-au.pri.fa pavfinder

The error message was :

No paths found for 2881917:fusion-chr1-179102032-L-chr20-45252239-R:0
  No paths found for 3995850:fusion-chr14-23356663-R-chr19-34684173-R:0
  No paths found for 2834775:fusion-chr3-143985747-R-chr16-13239597-L:0
  No paths found for 1915192:read_through-chrX-15601014-L-chrX-15659016-R:1
  Processed 25684 queries in 212.12 seconds (121.08 queries/sec)
  [E::idx_find_and_load] Could not retrieve index file for 'pavfinder/subseqs.bam'
  Note: gmap.avx512 does not exist.  For faster speed, may want to compile package on an AVX512 machine
  Note: gmap.avx2 does not exist.  For faster speed, may want to compile package on an AVX2 machine
  GMAP version 2020-10-14 called with args: gmap.sse42 -D /gpfs/home/anthony.ferrari/data/references/hg38-au/fusionbloom-1.7.1-db/hg38-au -d hg38-au pavfi
nder/probes.fa -n0 -f samse -t 12
  Checking compiler assumptions for SSE2: 6B8B4567 327B23C6 xor=59F066A1
  Checking compiler assumptions for SSE4.1: -103 -58 max=198 => compiler zero extends
  Checking compiler options for SSE4.2: 6B8B4567 __builtin_clz=1 __builtin_ctz=0 _mm_popcnt_u32=17 __builtin_popcount=17
  Finished checking compiler assumptions
  Pre-loading compressed genome (oligos)......done (1,162,470,968 bytes, 283807 pages, 0.15 sec)
  Looking for genome hg38-au in directory /gpfs/home/anthony.ferrari/data/references/hg38-au/fusionbloom-1.7.1-db/hg38-au
  Looking for index files in directory /gpfs/home/anthony.ferrari/data/references/hg38-au/fusionbloom-1.7.1-db/hg38-au
    Pointers file is hg38-au.ref153offsets64meta
    Offsets file is hg38-au.ref153offsets64strm
    Positions file is hg38-au.ref153positions
  Offsets compression type: bitpack64
  Allocating memory for ref offset pointers, kmer 15, interval 3...done (134,217,744 bytes, 0.13 sec)
  Allocating memory for ref offsets, kmer 15, interval 3...done (470,734,416 bytes, 0.40 sec)
  Pre-loading ref positions, kmer 15, interval 3......done (3,913,147,304 bytes, 0.30 sec)
  Starting alignment
  No paths found for 2374505:fusion-chr1-7919317-L-chr8-27598194-R:na
  No paths found for 2396459:fusion-chr8-27604976-L-chr16-74453253-R:na
  No paths found for 2329109:fusion-chr8-27604976-L-chr16-74453253-R:na
  No paths found for 4123594:fusion-chr2-231808847-L-chr5-10435666-R:na
  No paths found for 2964927:fusion-chr7-7573607-R-chr19-20545192-R:na
  No paths found for 1971989:fusion-chr6-144403184-L-chr7-156763795-L:na
  No paths found for 2508122:fusion-chr6-149591278-R-chr7-88277853-R:na
  No paths found for 1504185:fusion-chr3-100851922-R-chr4-17581789-R:na
  No paths found for 3164915:fusion-chr4-102827452-L-chr20-49949273-L:na
  Processed 17020 queries in 186.25 seconds (91.38 queries/sec)
  [E::idx_find_and_load] Could not retrieve index file for 'pavfinder/probes.bam'
  [W::fai_insert_index] Ignoring duplicate sequence "4586255:read_through-chr1-155679512-L-chr1-155777622-R:na" at byte offset 2425769
  Traceback (most recent call last):
    File "/opt/conda/envs/fusionbloom-1.7.1/bin/find_sv_transcriptome.py", line 259, in <module>
      main()
    File "/opt/conda/envs/fusionbloom-1.7.1/bin/find_sv_transcriptome.py", line 245, in main
      find_support(events_merged, args.r2c, args.query_fasta, min_overlap=args.min_overhang, num_procs=args.nproc, debug=args.debug)
    File "/opt/conda/envs/fusionbloom-1.7.1/lib/python3.8/site-packages/pavfinder/transcriptome/read_support.py", line 71, in find_support
      support = scan_all(coords,
    File "/opt/conda/envs/fusionbloom-1.7.1/lib/python3.8/site-packages/pavfinder/transcriptome/read_support.py", line 382, in scan_all
      batches = list(create_batches(bam_file, coords, contigs, len(contigs)//num_procs, overlap_buffer, contig_fasta_file, perfect, get_seq, debug=debug))
    File "/opt/conda/envs/fusionbloom-1.7.1/lib/python3.8/site-packages/pavfinder/transcriptome/read_support.py", line 349, in create_batches
      bam = pysam.Samfile(bam_file, 'rb')
    File "pysam/libcalignmentfile.pyx", line 742, in pysam.libcalignmentfile.AlignmentFile.__cinit__
    File "pysam/libcalignmentfile.pyx", line 991, in pysam.libcalignmentfile.AlignmentFile._open
  ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

I do not really understand this message. Do you have any hints to provide and is there any easy turnaround ?

Also, in the logs there are a lot of complains about missing BAM indexes. Should something be done about it ? (although in other samples this has not been an issue).

The only difference I can see with other samples (that completed successfully) is that this one is very large (3 times more reads).

Many thanks for your help, Anthony

readmanchiu commented 3 years ago

seems like there is something wrong with r2c.bam could you check if there is alignments where the read sequence is not outputted:

samtools view LIB00002320_S14-r2c.bam | awk '$10=="*"'

just suspecting that the soft-clipped alignments do not have read sequences kept in the bam file. If that's the case, adding -Y to the minimap2 r2c step will solve the problem

af8 commented 3 years ago

Well spotted. The problem is indeed that r2c.bam file is empty. The error was not seen before because I still got a successful exit code (0) even though minimap2 failed. I have broken down the Makefile to transform it in a nextflow pipeline and I forgot to put set -o pipefail before the call to the piped sequence so this is my bad.

The problem was the same as the one reported in issue #12 (even though -f was supplied to samtools command) :

[M::mm_idx_gen::77.962*1.58] collected minimizers
[M::mm_idx_gen::82.369*2.48] sorted minimizers
[WARNING] For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.
[M::main::82.369*2.48] loaded/built the index for 1820985 target sequence(s)
[M::mm_mapopt_update::82.369*2.48] mid_occ = 1000
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 1820985
[M::mm_idx_stat::84.495*2.44] distinct minimizers: 142973084 (47.36% are singletons); average occurrences: 4.628; average spacing: 6.045
[E::sam_parse1] no SQ lines present in the header
[W::sam_read1] Parse error at line 2
samtools view: error reading file "-"

Reading this issue from minimap2 repository, it looks that providing more memory to minimap2 through the -I option can fix the problem as minimap2 will be able to index all the sequence in memory in one shot.

I have almost 5M sequences in the file "transcripts.filtered.fa" and minimap2 is able to load/index approximately 1M with the default 4GB memory used. I have now re-launched with 80GB memory (certainly too much ;-)) minimap2 -ax sr -t20 -I80g ... and it should solve the problem. I will keep you updated in this thread.

Maybe you should give users the possibility to customise the memory provided to minimap2 as you do for samtools sort.

readmanchiu commented 3 years ago

Great, let me know how it goes, I can add a memory variable for r2c if -I takes care of the issue

readmanchiu commented 3 years ago

Will close the issue for now unless there is more feedback

af8 commented 3 years ago

Yes, adding -I option to minimap2 with enough memory solved the problem. For fasta file with large number of sequences --split_prefix is no longer needed by minimap2 to process the whole sample.

bcgsc / pavfinder

Error running pavfinder on a large sample #13