`scsnv collapse` SegFaulting

Alf-Kenz commented 1 year ago

Hi,

I'm sorry to bother you again, I tried really hard to figure it out myself (and can keep looking), but maybe you can help give me pointers. I'm running scsnv collapse and it's segfaulting, I verified with valgrind and got the resulting stack trace[1]. I also used gdb to step through (in Debug not Release mode of CMake) and understand the code much better, but still can't figure it out :-(

Interestingly, it does write successfully write 3852 lines to the collapsed.bam (224K) worth. But then stops, and ultimately segfaults.

I had a hard time tracing through the pthread'ing to find more details. Can you help me understand gwsc::CollapsedBamWriter::operator()(), when is that functor being called? I couldn't find it even with gdb because of the threading.

The commands I've run (only the collapse failed) I also updated the external/htslib to the latest one and recompiled, didn't help.

~/06_SNVs/01_scsnv/build/src/scsnv index
--gtf fromillumina/GRCh38/Annotation/Archives/archive-2015-08-11-09-31-31/Genes.gencode/genes.gtf
--ref fromillumina/GRCh38/Sequence/BWAIndex/genome.fa
scsnvPREFIX

 ~/06_SNVs/01_scsnv/build/src/scsnv map
-l V3
-i fromillumina/scsnvPREFIX
-g fromillumina/GRCh38/Sequence/BWAIndex/genome.fa
-b lib01/barcode
-t 15 --bam-write 10 --quant-threads 10
-c ~/06_SNVs/01_scsnv/data/gene_groups.txt
-o lib01
lib01

01_scsnv/build/src/scsnv collapse -l V3
-r fromillumina/GRCh38/Sequence/BWAIndex/genome.fa
-i tmp_fromillumina/GRCh38/scsnvPREFIX
-o lib01
--threads 1 --bam-write 1
-b lib01/barcode_counts.txt.gz
lib01merged.bam

Which prints

[00:00:00] Loading the genome
[00:00:07] Loading the transcriptome index
[00:00:08] Loaded 2650971 known barcodes from lib01/barcode_counts.txt.gz
[00:00:08] Writing collapsed alignments to lib01collapsed.bam
Segmentation fault

and then just dies (rerunning with valgrind to see if it gets more details, my notes don't have them).

Maybe tomorrow I can try to get it to output sam not bam (hard bc it duplicates the header) to see if the threading + bgzf compression is the problem.

[1]

Valgrind result

``` ==3546730== Use of uninitialised value of size 8 ==3546730== at 0x485B057: crc32_z (in /usr/lib/x86_64-linux-gnu/libz.so.1.2.11) ==3546730== by 0x30BF4E: bgzf_compress (bgzf.c:667) ==3546730== by 0x30C181: deflate_block (bgzf.c:702) ==3546730== by 0x30E419: bgzf_flush.part.0 (bgzf.c:1963) ==3546730== by 0x2AC165: bam_write1 (sam.c:834) ==3546730== by 0x256D94: gwsc::CollapsedBamWriter::operator()() (collapse_aux.cpp:329) ==3546730== by 0x5128ECF: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28) ==3546730== by 0x4C71EA6: start_thread (pthread_create.c:477) ==3546730== by 0x5483A2E: clone (clone.S:95) ``` ... Repeats many times ... Then ``` ==3404587== Syscall param write(buf) points to uninitialised byte(s) ==3404587== at 0x4C7BFEF: __libc_write (write.c:26) ==3404587== by 0x4C7BFEF: write (write.c:24) ==3404587== by 0x2E403A: fd_write (hfile.c:547) ==3404587== by 0x2E3E17: flush_buffer (hfile.c:357) ==3404587== by 0x2E5057: hwrite2 (hfile.c:398) ==3404587== by 0x2DF8DB: hwrite (hfile.h:287) ==3404587== by 0x2DF8DB: bgzf_flush.part.0 (bgzf.c:1811) ==3404587== by 0x2A6DD5: bam_write1 (sam.c:687) ==3404587== by 0x256CF4: gwsc::CollapsedBamWriter::operator()() (collapse_aux.cpp:329) ==3404587== by 0x5128ECF: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28) ==3404587== by 0x4C71EA6: start_thread (pthread_create.c:477) ==3404587== by 0x5483A2E: clone (clone.S:95) ==3404587== Address 0x834684e is 2,318 bytes inside a block of size 4,096 alloc'd ==3404587== at 0x483877F: malloc (vg_replace_malloc.c:307) ==3404587== by 0x2E4AC2: hfile_init (hfile.c:114) ==3404587== by 0x2E555B: hopen_fd (hfile.c:624) ==3404587== by 0x2E5CBA: hopen (hfile.c:1103) ==3404587== by 0x299C4D: hts_open_format (hts.c:620) ==3404587== by 0x25462F: gwsc::CollapsedBamWriter::CollapsedBamWriter(std::__cxx11::basic_string, s td::allocator > const&, unsigned int, sam_hdr_t const*) (collapse_aux.cpp:294) ==3404587== by 0x195435: int gwsc::ProgCollapse::run_wrap_() (pcollapse.cpp:68) ==3404587== by 0x158393: gwsc::ProgBase::parse(int, char**) (pbase.cpp:59) ==3404587== by 0x1407F0: main (scsnv.cpp:56) ==3546730== ==3546730== Syscall param write(buf) points to uninitialised byte(s) ==3546730== at 0x4C7BFEF: __libc_write (write.c:26) ==3546730== by 0x4C7BFEF: write (write.c:24) ==3546730== by 0x312FCA: fd_write (hfile.c:547) ==3546730== by 0x313DF2: hwrite2 (hfile.c:403) ==3546730== by 0x30E48B: hwrite (hfile.h:301) ==3546730== by 0x30E48B: bgzf_flush.part.0 (bgzf.c:1968) ==3546730== by 0x2AC165: bam_write1 (sam.c:834) ==3546730== by 0x256D94: gwsc::CollapsedBamWriter::operator()() (collapse_aux.cpp:329) ==3546730== by 0x5128ECF: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28) ==3546730== by 0x4C71EA6: start_thread (pthread_create.c:477) ==3546730== by 0x5483A2E: clone (clone.S:95) ==3546730== Address 0xa78f1259 is 67,977 bytes inside a block of size 131,072 alloc'd ==3546730== at 0x483877F: malloc (vg_replace_malloc.c:307) ==3546730== by 0x30A826: bgzf_write_init (bgzf.c:455) ==3546730== by 0x30BE81: bgzf_hopen (bgzf.c:542) ==3546730== by 0x29A9EA: hts_hopen (hts.c:1471) ==3546730== by 0x29AD93: hts_open_format (hts.c:887) ==3546730== by 0x2546CF: gwsc::CollapsedBamWriter::CollapsedBamWriter(std::__cxx11::basic_string, std::allocator > const&, unsigned int, sam_hdr_t const*) (collapse_aux.cpp:294) ==3546730== by 0x195485: int gwsc::ProgCollapse::run_wrap_() (pcollapse.cpp:68) ==3546730== by 0x1583E3: gwsc::ProgBase::parse(int, char**) (pbase.cpp:59) ==3546730== by 0x140840: main (scsnv.cpp:56) ```

GWW commented 1 year ago

Hi @Alf-Kenz,

That is a strange error and I have not encountered it before. The way the bam writing works is that it writes in the background while the next set of reads are being processed (to speed up the code a bit). The main loop for the program is in pcollapsed.cpp in line 92 to 128. The bout.start() call on line 126 is when the writing processor is started. I do it this way to ensure that the collapsed bam file is sorted to avoid needing to call samtools sort.

Could you provide me with a gdb backtrace of what is happening and can try to figure out what is going on. Just out of curiosity is there disk space available for the bam file?

The other issue I have had is if there are a huge number of reads mapping to the same gene, this can lead to the step running out of memory. I have pushed an update (you'll need to run git pull in the repository and recompile) that limits the number of reads from a gene region to 5,000,000. If it exceeds that number that gene is skipped. This should prevent a gene with very very high expression from crashing the program.

If it seems like the program is running out of memory you can reduce the number of reads using -m 2000000 (or any other number) and you can increase this number if you get a lot of warnings using -m 10000000. Eventually, I am going to look into a way to splitting collapse to use a subset of the barcodes and then merge the resultant bam files to avoid this issue altogether.

I tested it on a single sample and it seems to be working.

Hopefully this helps the issue.

Gavin

GWW commented 1 year ago

Sorry, I made a few mistakes with how the warning works and fixed those and updated the README file. Please make sure you have pulled the most recent version of the code.

Thanks,

Gavin

Alf-Kenz commented 1 year ago

Hi,

Amazing! I think that could have been the problem. I get these mega regions which have way too many reads. When I add this in, it seems to make progress, I'll let it run and check back after dinner if it's writing to the bam

It gave:

[Warning] exceeded 5000000 skipping this gene region 56329735 reads. Region chr1:188021 - 1113179642
[Warning] exceeded 5000000 skipping this gene region 53416157 reads. Region chr2:218135 - 1543818173

As a test, I reduced the max reads all the way to 2, so I could see what happens [1].

There are more normal sized regions and then humongous regions which were why it was failing

[Warning] exceeded 1 skipping this gene region 417 reads. Region chr1:14606 - 17367
[Warning] exceeded 1 skipping this gene region 10227 reads. Region chr1:17412 - 29569
[Warning] exceeded 1 skipping this gene region 2 reads. Region chr1:30818 - 31108
[Warning] exceeded 1 skipping this gene region 15 reads. Region chr1:89254 - 91628
[Warning] exceeded 1 skipping this gene region 754 reads. Region chr1:93187 - 129222
[Warning] exceeded 1 skipping this gene region 7 reads. Region chr1:134027 - 134835
[Warning] exceeded 1 skipping this gene region 318 reads. Region chr1:135091 - 135894
[Warning] exceeded 1 skipping this gene region 10 reads. Region chr1:137720 - 137964
[Warning] exceeded 1 skipping this gene region 54 reads. Region chr1:148879 - 155830
[Warning] exceeded 1 skipping this gene region 49 reads. Region chr1:164440 - 173861
[Warning] exceeded 1 skipping this gene region 2353 reads. Region chr1:184924 - 187885
[Warning] exceeded 1 skipping this gene region 56329735 reads. Region chr1:188021 - 1113179642
[Warning] exceeded 1 skipping this gene region 2475 reads. Region chr2:38807 - 46869
[Warning] exceeded 1 skipping this gene region 53416157 reads. Region chr2:218135 - 1543818173
[Warning] exceeded 1 skipping this gene region 56 reads. Region chr3:53408 - 54345

and I'll look tonight if I just gave a super weird reference somehow. This library is deeply sequenced but those regions seem wrong. Maybe it's accidentally using toplevel chromsome-level regions. If it is, I'm sorry for wasting your time

[1] Small thing but I had trouble getting the CLI paramater to make a difference, I ended up having to modify

args_["reads"].as<unsigned int>(5000000);

GWW commented 1 year ago

Those are definitely strange results. chr1 is only ~250M base pairs and your last region in chromosome 1 is nearly 1 billion bases in length (chr1:188021 - 1113179642). Could it be something to do with your transcriptome annotation file? It looks to be quite old. I usually use ensembl's gtf.

In the test sample I ran this on, I only had one warning region in the mitochondria which isn't surprising. I updated the command line options to take the reads in milliions and fixed the parameter parsing error (sorry about that, it's been a long time since I've worked with this code). I tested it with -r 20 and the warning below went away:

[Warning] exceeded 10000000 skipping this gene region 13395811 reads. Region MT:581 - 4328

Alf-Kenz commented 1 year ago

Hi I found it! I checked separately for when the max_rgt_ increased by a very large number and found some very strange lines in the bam [1], they key being the crazy cigar strings like 41M172N48M 268435408N 1M3S (space added for emphasis). So the stacktrace goes:

crazy long 268435408 N=splice -> huge bam_cigar2rlen return -> huge bam_endpos -> huge max_rgt.

Looks like ENSG00000273874.1 is a microRNA so weird DNA as it is.

Possibly the alignment should have a max splice distance. How would you suggest having it ignore these reads? Modify bam_endpos so if max_rgt jumps too big, it continue's and restores the max_rgt to a previous value. Modify and rerun the alignment to not output if huge splice. Add a quick filter to the bam before read in again

I don't want later processing to also get confused by it, I gusss

[1] Weird lines, where I replaced the exact nucleotide sequences with NUCSEQ

A00440:302:HVGJKDSXX:2:2513:19669:2487   1040  chr1  188226  255  41M172N48M268435408N1M3S  *  0  -1  NUCSEQ  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NM:i:0  XS:A:-  UB:Z:AGGGTATTCTCT  CB:Z:CATTGAGTCCGCTTAC  AS:i:90  XG:Z:ENSG00000273874.1  XI:i:20  RE:A:E  RA:A:E

A00440:302:HVGJKDSXX:2:2512:18927:23469  16    chr1  188226  255  41M172N48M268435408N1M3S  *  0  -1  NUCSEQ  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F  NM:i:0  XS:A:-  UB:Z:AGGGTATTCTCT  CB:Z:CATTGAGTCCGCTTAC  AS:i:90  XG:Z:ENSG00000273874.1  XI:i:20  RE:A:E  RA:A:E

Alf-Kenz commented 1 year ago

I will also note that I don't know what XA means, but it does seem to implicated many reads both in chromosome 1 and chromosomes X/2/12/9 just in this small region. (i.e.. the 3rd column in the bam is chr1 for all of these)

Which seems like it could be a little suspect. Did you think I set up the reference wrong so it doesn't make these large scale chromosomal boundaries apparent

GWW commented 1 year ago

The XA tag are alternative alignments emitted by BWA-mem. They shouldn't be an issue as these reads will be assigned a very low mapping quality and ultimately filtered.

I am more concerned about your reference build it's strange to have a gene annotation with a splice site like that. I would seriously considering updating to a newer version if you can.

When I look for that gene in my reference file it doesn't appear to have that large splice junction, so I am assuming it's a bug in your reference. Could you grep the gene ENSG00000273874 in your gtf for me? I want to make sure it isn't a GTF parsing error on my end.

I will also add a limit to the intron size on the scsnv indexing step and push it on the repo tomorrow.

zcat Homo_sapiens.GRCh38.102.gtf.gz  | grep ENSG00000273874
       mirbase gene    187891  187958  .       -       .       gene_id "ENSG00000273874"; gene_version "1"; gene_name "MIR6859-2"; gene_source "mirbase"; gene_biotype "miRNA";
1       mirbase transcript      187891  187958  .       -       .       gene_id "ENSG00000273874"; gene_version "1"; transcript_id "ENST00000612080"; transcript_version "1"; gene_name "MIR6859-2"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR6859-2-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; tag "basic"; transcript_support_level "NA";
1       mirbase exon    187891  187958  .       -       .       gene_id "ENSG00000273874"; gene_version "1"; transcript_id "ENST00000612080"; transcript_version "1"; exon_number "1"; gene_name "MIR6859-2"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR6859-2-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; exon_id "ENSE00003737837"; exon_version "1"; tag "basic"; transcript_support_level "NA";

Alf-Kenz commented 1 year ago

As always, thanks very much for responding

If I look at that I get [1].

But I should say I'm not attached to this reference in any way, I'll just bite the bullet and wait for bwa index to run on Homo_sapiens.GRCh38.109.gtf.gz and Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz like we had discussed in the other issue (?)

[1]

rg ENSG00000273874 Genes.gencode/genes.gtf

8434:chr1       ENSEMBL exon    187891  187958  .       -       .       exon_id "ENSE00003737837.1"; exon_number "1"; gene_id "ENSG00000273874.1"; gene_name "MIR6859-2"; gene_status "KNOWN"; gene_type "miRNA"; level "3"; tag "basic"; transcript_id "ENST00000612080.1"; transcript_name "MIR6859-2-201"; transcript_status "KNOWN"; transcript_support_level "NA"; transcript_type "miRNA"; tss_id "TSS7590";
8435:chr1       ENSEMBL transcript      187891  187958  .       -       .       gene_id "ENSG00000273874.1"; gene_name "MIR6859-2"; gene_status "KNOWN"; gene_type "miRNA"; level "3"; tag "basic"; transcript_id "ENST00000612080.1"; transcript_name "MIR6859-2-201"; transcript_status "KNOWN"; transcript_support_level "NA"; transcript_type "miRNA"; tss_id "TSS7590";

GWW commented 1 year ago

It looks like there may be some kind of bug. That gene should not have any reference skips and it should be filtered by the default length filtering in the scsnv index step. I automatically filter out transcripts less than 100 bp and this one is only 67 bp. Do you know which version of gencode you are using? I will try to download it and investigate further to see if I can find a bug.

It's weird to get an alignment with such a massive intron. Could you grep for ENST00000612080 in the fromillumina/scsnvPREFIX.transcripts.txt.gz file?

Alf-Kenz commented 1 year ago

Sure, that gives

Homo_sapiens/NCBI/GRCh38/scsnv_transcripts.txt.gz
33:chr1 0       20      31      ENST00000612080.1       MIR6859-2-201   187890  200321  -       19      187890,188021,188101,188125,188129,188129,188438,188438,188438,188488,188488,188488,188790,188790,188790,195258,195258,195262,200049       187957,188027,188104,188265,188265,188265,188485,188485,188485,188583,188583,188583,188888,188891,188901,195410,195415,195410,200321       -1      -1

But for now I'll start direct from those newest files and see if that fixes it, it will take a little bit of time to run the alignment but I can search through some of the in progress bams just to see if it pops up

GWW commented 1 year ago

It's strange that it has all these splice junctions when the gene shouldn't have any. Which gencode build is your GTF from? I can try indexing it and debugging what is going on.

Thanks,

Gavin

Alf-Kenz commented 1 year ago

Oh ok, so it was using the iGenome pre indexed set from here although in my rush to try to get it running I didn't follow it's symlinks so I now see it says '2015-08-11-09-31-31' for the gencode ID, which I'm sad about.

So now I'm rerunning with the most recent set

Alf-Kenz commented 1 year ago

Well the newest reference seems to work great! 🙃 What a roller-coaster, I'm sorry for taking you down a rabbit hole.

I added a little if statement to make sure that there aren't huge jumps in the splicing, so will let you know if it comes up again with a more modern reference.

Thanks for your help again

GWW commented 1 year ago

Thanks for letting me know and I am glad everything seems to be working.

On Fri, Apr 21, 2023, 4:44 p.m. Alf-Kenz @.***> wrote:

Closed #24 https://github.com/GWW/scsnv/issues/24 as completed.

— Reply to this email directly, view it on GitHub https://github.com/GWW/scsnv/issues/24#event-9070880032, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHAEADU7KNW44XWYFT4ZVLXCLWRRANCNFSM6AAAAAAXET7BQM . You are receiving this because you commented.Message ID: @.***>

GWW / scsnv

`scsnv collapse` SegFaulting #24