Closed Alf-Kenz closed 1 year ago
Hi @Alf-Kenz,
That is a strange error and I have not encountered it before. The way the bam writing works is that it writes in the background while the next set of reads are being processed (to speed up the code a bit). The main loop for the program is in pcollapsed.cpp
in line 92 to 128. The bout.start()
call on line 126 is when the writing processor is started. I do it this way to ensure that the collapsed bam file is sorted to avoid needing to call samtools sort
.
Could you provide me with a gdb
backtrace of what is happening and can try to figure out what is going on. Just out of curiosity is there disk space available for the bam file?
The other issue I have had is if there are a huge number of reads mapping to the same gene, this can lead to the step running out of memory. I have pushed an update (you'll need to run git pull
in the repository and recompile) that limits the number of reads from a gene region to 5,000,000. If it exceeds that number that gene is skipped. This should prevent a gene with very very high expression from crashing the program.
If it seems like the program is running out of memory you can reduce the number of reads using -m 2000000
(or any other number) and you can increase this number if you get a lot of warnings using -m 10000000
. Eventually, I am going to look into a way to splitting collapse to use a subset of the barcodes and then merge the resultant bam files to avoid this issue altogether.
I tested it on a single sample and it seems to be working.
Hopefully this helps the issue.
Gavin
Sorry, I made a few mistakes with how the warning works and fixed those and updated the README file. Please make sure you have pulled the most recent version of the code.
Thanks,
Gavin
Hi,
Amazing! I think that could have been the problem. I get these mega regions which have way too many reads. When I add this in, it seems to make progress, I'll let it run and check back after dinner if it's writing to the bam
It gave:
[Warning] exceeded 5000000 skipping this gene region 56329735 reads. Region chr1:188021 - 1113179642
[Warning] exceeded 5000000 skipping this gene region 53416157 reads. Region chr2:218135 - 1543818173
As a test, I reduced the max reads all the way to 2, so I could see what happens [1].
There are more normal sized regions and then humongous regions which were why it was failing
[Warning] exceeded 1 skipping this gene region 417 reads. Region chr1:14606 - 17367
[Warning] exceeded 1 skipping this gene region 10227 reads. Region chr1:17412 - 29569
[Warning] exceeded 1 skipping this gene region 2 reads. Region chr1:30818 - 31108
[Warning] exceeded 1 skipping this gene region 15 reads. Region chr1:89254 - 91628
[Warning] exceeded 1 skipping this gene region 754 reads. Region chr1:93187 - 129222
[Warning] exceeded 1 skipping this gene region 7 reads. Region chr1:134027 - 134835
[Warning] exceeded 1 skipping this gene region 318 reads. Region chr1:135091 - 135894
[Warning] exceeded 1 skipping this gene region 10 reads. Region chr1:137720 - 137964
[Warning] exceeded 1 skipping this gene region 54 reads. Region chr1:148879 - 155830
[Warning] exceeded 1 skipping this gene region 49 reads. Region chr1:164440 - 173861
[Warning] exceeded 1 skipping this gene region 2353 reads. Region chr1:184924 - 187885
[Warning] exceeded 1 skipping this gene region 56329735 reads. Region chr1:188021 - 1113179642
[Warning] exceeded 1 skipping this gene region 2475 reads. Region chr2:38807 - 46869
[Warning] exceeded 1 skipping this gene region 53416157 reads. Region chr2:218135 - 1543818173
[Warning] exceeded 1 skipping this gene region 56 reads. Region chr3:53408 - 54345
and I'll look tonight if I just gave a super weird reference somehow. This library is deeply sequenced but those regions seem wrong. Maybe it's accidentally using toplevel chromsome-level regions. If it is, I'm sorry for wasting your time
[1] Small thing but I had trouble getting the CLI paramater to make a difference, I ended up having to modify
args_["reads"].as<unsigned int>(5000000);
Those are definitely strange results. chr1
is only ~250M base pairs and your last region in chromosome 1 is nearly 1 billion bases in length (chr1:188021 - 1113179642). Could it be something to do with your transcriptome annotation file? It looks to be quite old. I usually use ensembl's gtf.
In the test sample I ran this on, I only had one warning region in the mitochondria which isn't surprising. I updated the command line options to take the reads in milliions and fixed the parameter parsing error (sorry about that, it's been a long time since I've worked with this code). I tested it with -r 20
and the warning below went away:
[Warning] exceeded 10000000 skipping this gene region 13395811 reads. Region MT:581 - 4328
Hi
I found it! I checked separately for when the max_rgt_
increased by a very large number and found some very strange lines in the bam [1], they key being the crazy cigar strings like 41M172N48M 268435408N 1M3S
(space added for emphasis). So the stacktrace goes:
crazy long 268435408 N
=splice -> huge bam_cigar2rlen
return -> huge bam_endpos
-> huge max_rgt
.
Looks like ENSG00000273874.1
is a microRNA so weird DNA as it is.
Possibly the alignment should have a max splice distance. How would you suggest having it ignore these reads? Modify bam_endpos
so if max_rgt
jumps too big, it continue
's and restores the max_rgt to a previous value. Modify and rerun the alignment to not output if huge splice. Add a quick filter to the bam before read in again
I don't want later processing to also get confused by it, I gusss
[1] Weird lines, where I replaced the exact nucleotide sequences with NUCSEQ
A00440:302:HVGJKDSXX:2:2513:19669:2487 1040 chr1 188226 255 41M172N48M268435408N1M3S * 0 -1 NUCSEQ FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 XS:A:- UB:Z:AGGGTATTCTCT CB:Z:CATTGAGTCCGCTTAC AS:i:90 XG:Z:ENSG00000273874.1 XI:i:20 RE:A:E RA:A:E
A00440:302:HVGJKDSXX:2:2512:18927:23469 16 chr1 188226 255 41M172N48M268435408N1M3S * 0 -1 NUCSEQ FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F NM:i:0 XS:A:- UB:Z:AGGGTATTCTCT CB:Z:CATTGAGTCCGCTTAC AS:i:90 XG:Z:ENSG00000273874.1 XI:i:20 RE:A:E RA:A:E
I will also note that I don't know what XA means, but it does seem to implicated many reads both in chromosome 1 and chromosomes X/2/12/9 just in this small region. (i.e.. the 3rd column in the bam is chr1
for all of these)
Which seems like it could be a little suspect. Did you think I set up the reference wrong so it doesn't make these large scale chromosomal boundaries apparent
The XA tag are alternative alignments emitted by BWA-mem. They shouldn't be an issue as these reads will be assigned a very low mapping quality and ultimately filtered.
I am more concerned about your reference build it's strange to have a gene annotation with a splice site like that. I would seriously considering updating to a newer version if you can.
When I look for that gene in my reference file it doesn't appear to have that large splice junction, so I am assuming it's a bug in your reference. Could you grep the gene ENSG00000273874
in your gtf for me? I want to make sure it isn't a GTF parsing error on my end.
I will also add a limit to the intron size on the scsnv indexing step and push it on the repo tomorrow.
zcat Homo_sapiens.GRCh38.102.gtf.gz | grep ENSG00000273874
mirbase gene 187891 187958 . - . gene_id "ENSG00000273874"; gene_version "1"; gene_name "MIR6859-2"; gene_source "mirbase"; gene_biotype "miRNA";
1 mirbase transcript 187891 187958 . - . gene_id "ENSG00000273874"; gene_version "1"; transcript_id "ENST00000612080"; transcript_version "1"; gene_name "MIR6859-2"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR6859-2-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; tag "basic"; transcript_support_level "NA";
1 mirbase exon 187891 187958 . - . gene_id "ENSG00000273874"; gene_version "1"; transcript_id "ENST00000612080"; transcript_version "1"; exon_number "1"; gene_name "MIR6859-2"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR6859-2-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; exon_id "ENSE00003737837"; exon_version "1"; tag "basic"; transcript_support_level "NA";
As always, thanks very much for responding
If I look at that I get [1].
But I should say I'm not attached to this reference in any way, I'll just bite the bullet and wait for bwa index
to run on
Homo_sapiens.GRCh38.109.gtf.gz
and
Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
like we had discussed in the other issue (?)
[1]
rg ENSG00000273874 Genes.gencode/genes.gtf
8434:chr1 ENSEMBL exon 187891 187958 . - . exon_id "ENSE00003737837.1"; exon_number "1"; gene_id "ENSG00000273874.1"; gene_name "MIR6859-2"; gene_status "KNOWN"; gene_type "miRNA"; level "3"; tag "basic"; transcript_id "ENST00000612080.1"; transcript_name "MIR6859-2-201"; transcript_status "KNOWN"; transcript_support_level "NA"; transcript_type "miRNA"; tss_id "TSS7590";
8435:chr1 ENSEMBL transcript 187891 187958 . - . gene_id "ENSG00000273874.1"; gene_name "MIR6859-2"; gene_status "KNOWN"; gene_type "miRNA"; level "3"; tag "basic"; transcript_id "ENST00000612080.1"; transcript_name "MIR6859-2-201"; transcript_status "KNOWN"; transcript_support_level "NA"; transcript_type "miRNA"; tss_id "TSS7590";
It looks like there may be some kind of bug. That gene should not have any reference skips and it should be filtered by the default length filtering in the scsnv index
step. I automatically filter out transcripts less than 100 bp and this one is only 67 bp. Do you know which version of gencode you are using? I will try to download it and investigate further to see if I can find a bug.
It's weird to get an alignment with such a massive intron. Could you grep for ENST00000612080
in the fromillumina/scsnvPREFIX.transcripts.txt.gz
file?
Sure, that gives
Homo_sapiens/NCBI/GRCh38/scsnv_transcripts.txt.gz
33:chr1 0 20 31 ENST00000612080.1 MIR6859-2-201 187890 200321 - 19 187890,188021,188101,188125,188129,188129,188438,188438,188438,188488,188488,188488,188790,188790,188790,195258,195258,195262,200049 187957,188027,188104,188265,188265,188265,188485,188485,188485,188583,188583,188583,188888,188891,188901,195410,195415,195410,200321 -1 -1
But for now I'll start direct from those newest files and see if that fixes it, it will take a little bit of time to run the alignment but I can search through some of the in progress bams just to see if it pops up
It's strange that it has all these splice junctions when the gene shouldn't have any. Which gencode build is your GTF from? I can try indexing it and debugging what is going on.
Thanks,
Gavin
Oh ok, so it was using the iGenome pre indexed set from here although in my rush to try to get it running I didn't follow it's symlinks so I now see it says '2015-08-11-09-31-31' for the gencode ID, which I'm sad about.
So now I'm rerunning with the most recent set
Well the newest reference seems to work great! 🙃 What a roller-coaster, I'm sorry for taking you down a rabbit hole.
I added a little if statement to make sure that there aren't huge jumps in the splicing, so will let you know if it comes up again with a more modern reference.
Thanks for your help again
Thanks for letting me know and I am glad everything seems to be working.
On Fri, Apr 21, 2023, 4:44 p.m. Alf-Kenz @.***> wrote:
Closed #24 https://github.com/GWW/scsnv/issues/24 as completed.
— Reply to this email directly, view it on GitHub https://github.com/GWW/scsnv/issues/24#event-9070880032, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHAEADU7KNW44XWYFT4ZVLXCLWRRANCNFSM6AAAAAAXET7BQM . You are receiving this because you commented.Message ID: @.***>
Hi,
I'm sorry to bother you again, I tried really hard to figure it out myself (and can keep looking), but maybe you can help give me pointers. I'm running
scsnv collapse
and it's segfaulting, I verified with valgrind and got the resulting stack trace[1]. I also used gdb to step through (in Debug not Release mode of CMake) and understand the code much better, but still can't figure it out :-(Interestingly, it does write successfully write 3852 lines to the collapsed.bam (224K) worth. But then stops, and ultimately segfaults.
I had a hard time tracing through the pthread'ing to find more details. Can you help me understand
gwsc::CollapsedBamWriter::operator()()
, when is that functor being called? I couldn't find it even with gdb because of the threading.The commands I've run (only the collapse failed) I also updated the
external/htslib
to the latest one and recompiled, didn't help.Which prints
and then just dies (rerunning with valgrind to see if it gets more details, my notes don't have them).
Maybe tomorrow I can try to get it to output sam not bam (hard bc it duplicates the header) to see if the threading + bgzf compression is the problem.
[1]
Valgrind result
``` ==3546730== Use of uninitialised value of size 8 ==3546730== at 0x485B057: crc32_z (in /usr/lib/x86_64-linux-gnu/libz.so.1.2.11) ==3546730== by 0x30BF4E: bgzf_compress (bgzf.c:667) ==3546730== by 0x30C181: deflate_block (bgzf.c:702) ==3546730== by 0x30E419: bgzf_flush.part.0 (bgzf.c:1963) ==3546730== by 0x2AC165: bam_write1 (sam.c:834) ==3546730== by 0x256D94: gwsc::CollapsedBamWriter::operator()() (collapse_aux.cpp:329) ==3546730== by 0x5128ECF: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28) ==3546730== by 0x4C71EA6: start_thread (pthread_create.c:477) ==3546730== by 0x5483A2E: clone (clone.S:95) ``` ... Repeats many times ... Then ``` ==3404587== Syscall param write(buf) points to uninitialised byte(s) ==3404587== at 0x4C7BFEF: __libc_write (write.c:26) ==3404587== by 0x4C7BFEF: write (write.c:24) ==3404587== by 0x2E403A: fd_write (hfile.c:547) ==3404587== by 0x2E3E17: flush_buffer (hfile.c:357) ==3404587== by 0x2E5057: hwrite2 (hfile.c:398) ==3404587== by 0x2DF8DB: hwrite (hfile.h:287) ==3404587== by 0x2DF8DB: bgzf_flush.part.0 (bgzf.c:1811) ==3404587== by 0x2A6DD5: bam_write1 (sam.c:687) ==3404587== by 0x256CF4: gwsc::CollapsedBamWriter::operator()() (collapse_aux.cpp:329) ==3404587== by 0x5128ECF: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28) ==3404587== by 0x4C71EA6: start_thread (pthread_create.c:477) ==3404587== by 0x5483A2E: clone (clone.S:95) ==3404587== Address 0x834684e is 2,318 bytes inside a block of size 4,096 alloc'd ==3404587== at 0x483877F: malloc (vg_replace_malloc.c:307) ==3404587== by 0x2E4AC2: hfile_init (hfile.c:114) ==3404587== by 0x2E555B: hopen_fd (hfile.c:624) ==3404587== by 0x2E5CBA: hopen (hfile.c:1103) ==3404587== by 0x299C4D: hts_open_format (hts.c:620) ==3404587== by 0x25462F: gwsc::CollapsedBamWriter::CollapsedBamWriter(std::__cxx11::basic_string