WGLab / LinkedSV

MIT License
20 stars 8 forks source link

Corrupted aux data #6

Closed bioinfornatics closed 5 years ago

bioinfornatics commented 5 years ago

Dear,

The tool named output_bam_coreinfo segfault around line 52

https://github.com/WGLab/LinkedSV/blob/4ca5333b6f1a1c187aaf3d9f10b9a9cf5fee385c/src/output_bam_coreinfo.c#L52

And print to stderr a wall or errors:

[E::sam_format1] Corrupted aux data for read X1:1:H2GKLCCXY:1:1202:2098:22458
[E::sam_format1] Corrupted aux data for read X1:1:H2GKLCCXY:1:2104:29954:72403
[E::sam_format1] Corrupted aux data for read X1:1:H2GKLCCXY:1:1122:2443:17535
[E::sam_format1] Corrupted aux data for read X1:1:H2GKLCCXY:1:1202:31730:47632
[E::sam_format1] Corrupted aux data for read X1:1:H2GKLCCXY:1:2123:4513:56458
[E::sam_format1] Corrupted aux data for read X1:1:H2GKLCCXY:1:1118:25885:53856
[E::sam_format1] Corrupted aux data for read X1:1:H2GKLCCXY:1:1221:1164:27732
[E::sam_format1] Corrupted aux data for read X1:1:H2GKLCCXY:1:1209:3143:6970
fangli80 commented 5 years ago

Hi, Thank you for using LinkedSV. Looks like there are issues with the aux field.

Could you please use samtools view inputbam | grep "X1:1:H2GKLCCXY:1:1202:2098:22458" to show the entire information for this read?

Thank you! Li

bioinfornatics commented 5 years ago

Hi,

Thanks for your help

$ samtools view inputbam | grep "X1:1:H2GKLCCXY:1:1202:2098:22458"
X1:1:H2GKLCCXY:1:1202:2098:22458        83      chr1    9997    19      87S41M  =       10004   -34     CCCCAACCCGAACAATAACCCTAATCCTCACGCAAAAGACAGCACACACCCTCCCATCTGCTGACTCGAGTTCAAACGCGTGCCCTACCGATCACCCTAACCCTAACCCTAACCCTAACCCTAACCCT      7))))))<7-FA7--7-J77--7------A-7-A-<-7-7--FA-7--7--F7-A7-7--7--77---F7--F7-77--7--A-7<-7<--7--F-JJJJJJJJJJJFJJJJJJJFJJJJJJJJJJJJ        QT:Z:<AAFFJJJ   BC:Z:CAGATGGG   QX:Z:AAAFFFJJJJJFJJJJ AM:A:0  XM:A:0  TR:Z:TAGGGTT    TQ:Z:JFJJJJJ    AS:f:-107       XS:f:-107       XT:i:1  RX:Z:ACTACAGGGTTAGGGT   OM:i:1  RG:Z:B00IF30:LibraryNotSpecified:1:unknown_fc:0
X1:1:H2GKLCCXY:1:1202:2098:22458        163     chr1    10004   0       52M99S  =       9997    34      CGCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTGTAGTAGATCGTAACAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAATAAACTAATCCAACGGCACAGGCTTACGA       AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJFJ-7-77-7F7F--AF--7-77FF<FJF77AFFFF<7-A<<<JFFJJFJ-F7<<F<AAAAFFJFF<-AA7-7-------<------))))))))<)-----       QT:Z:<AAFFJJJ   BC:Z:CAGATGGG   QX:Z:AAAFFFJJJJJFJJJJ   AM:A:0  XM:A:0  RX:Z:ACTACAGGGTTAGGGT   AS:f:-107       XS:f:-107       XT:i:1  OM:i:0  RG:Z:B00IF30:LibraryNotSpecified:1:unknown_fc:0

I would like to know:

Here the bam was generated with longranger 2.1.4

GDB trace

Program received signal SIGSEGV, Segmentation fault.
0x0000000000400a47 in output_coreinfo_bam (
    in_bam=0x7fffffffb7da "input.bam") at output_bam_coreinfo.c:52
52                  *p1 = *p2;
(gdb) bt
#0  0x0000000000400a47 in output_coreinfo_bam (
    in_bam=0x7fffffffb7da "input.bam") at output_bam_coreinfo.c:52
#1  0x0000000000400bc9 in main (argc=2, argv=0x7fffffffb0d8) at output_bam_coreinfo.c:92
(gdb) print p2
$1 = (uint8_t *) 0xa27 <Address 0xa27 out of bounds>
(gdb) where
#0  0x0000000000400a47 in output_coreinfo_bam (
    in_bam=0x7fffffffb7da "input.bam") at output_bam_coreinfo.c:52
#1  0x0000000000400bc9 in main (argc=2, argv=0x7fffffffb0d8) at output_bam_coreinfo.c:92
(gdb) l
47              aux = bam_get_aux(b);
48              p1 = seq;
49              p2 = aux;
50              while (p2 < b->data + b->l_data)
51              {
52                  *p1 = *p2;
53                  p1 ++;
54                  p2 ++;
55              }
56
fangli80 commented 5 years ago

Thank you. I have tested on bam files generated by longranger-2.1.2 and longranger-2.1.6. There were no errors. Also, your bam file looks normal and the auxiliary field looks quite similar to mine. But this is in plain text format. To test on the binary format, could you please upload a small example bam file with this error so that I can test on my side?

Thank you. Li

fangli80 commented 5 years ago

and what is your gcc version?

bioinfornatics commented 5 years ago

and what is your gcc version?

$ gcc --version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23)
bioinfornatics commented 5 years ago

To test on the binary format, could you please upload a small example bam file with this error so that I can test on my side?

it is a zipped bam as github complain about .bam extension I removed it. So you have to put it back

B00IF30.samtools1.9.test.header_mod.zip

fangli80 commented 5 years ago

Hi,

It works on my side. I cloned the latest version of LinkedSV and ran the following commands:

git clone git@github.com:WGLab/LinkedSV.git
cd LinkedSV
sh build.sh

And then run the binary file output_bam_coreinfo under LinkedSV/bin/

path/to/LinkedSV/bin/output_bam_coreinfo  B00IF30.samtools1.9.test.header_mod.bam  > B00IF30.samtools1.9.test.header_mod.bam.coreinfo.sam

The sam file B00IF30.samtools1.9.test.header_mod.bam.coreinfo.sam can be generated without errors.

My gcc version is

$ gcc --version 
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

My OS is Centos 7

I also ran these commands:

gdb --args /home/fangl/LinkedSV/bin/output_bam_coreinfo  B00IF30.samtools1.9.test.header_mod.bam
b output_bam_coreinfo.c:52
r

And I get these infor:

Breakpoint 1, output_coreinfo_bam (in_bam=0x7fffffffd5bb "B00IF30.samtools1.9.test.header_mod.bam") at output_bam_coreinfo.c:52
52              *p1 = *p2;
(gdb) p p2
$1 = (uint8_t *) 0x68de1c "QTZAAFFFJJJ"
(gdb) p p1
$2 = (uint8_t *) 0x68dd39 "(\204\210\201\030A\201$BA\"\022$\022\030(\022\022\202\210\202\"\201!$\022B\202\210\"A\202\204\202\"\030\021\"(\021\"(\021\"(\021\"(\021\"(\021\"(\021\"(\021\"(\021\"(\021\"(\021\"(\021\"(\021\"(\200\f\f\f\f\026\f\026\f\f%\033\f\f\033\026\f\f% \026) \026\033\033\026\033\f \f\033\026%% %\026\033\f\f\026)%\026\f\f)\033 %)%\033))%)%)   %) \f%)))%\033%%)%%)%%)))))))))%%\033)))%) ))))%", ')' <repeats 18 times>, "%))"...
(gdb) 

This information looks normal and I cannot find errors. Could you please run the same commands and see what you get.

Thanks.

bioinfornatics commented 5 years ago
$ gdb --args ./output_bam_coreinfo B00IF30.samtools1.9.test.header_mod.bam
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from src/output_bam_coreinfo...done.
(gdb) b output_bam_coreinfo.c:52
Breakpoint 1 at 0x400a43: file output_bam_coreinfo.c, line 52.
(gdb) r
Starting program: B00IF30.samtools1.9.test.header_mod.bam
[Thread debugging using libthread_db enabled]
@HD VN:1.3  SO:coordinate
@SQ SN:chr1 LN:249250621    AS:chr1 SP:human
@SQ SN:chr2 LN:243199373    AS:chr2 SP:human
@SQ SN:chr3 LN:198022430    AS:chr3 SP:human
@SQ SN:chr4 LN:191154276    AS:chr4 SP:human
…
…
[E::sam_format1] Corrupted aux data for read X1:1:H2GKLCCXY:1:1221:1164:27732
[E::sam_format1] Corrupted aux data for read X1:1:H2GKLCCXY:1:1209:3143:6970
[E::sam_format1] Corrupted aux data for read X1:1:H2GKLCCXY:1:1111:32126:70926
…
Breakpoint 1, output_coreinfo_bam (
    in_bam=0x7fffffffb667 "…B00IF30.samtools1.9.test.header_mod.bam") at output_bam_coreinfo.c:52
52              *p1 = *p2;
(gdb) p p2
$1 = (uint8_t *) 0xa27 <Address 0xa27 out of bounds>
(gdb) p p1
$2 = (uint8_t *) 0xa24 <Address 0xa24 out of bounds>

I will rebuild with another gcc version Thanks a lot for your help

Edit same result with gcc 6.4.0 I will try tomorrow on centos 7

bioinfornatics commented 5 years ago

Here the core dump dump_gcc_6_4_0.zip

fangli80 commented 5 years ago

OK. In line 47, bam_get_aux(b) is a macro defined in sam.h in htslib. So when you rebuild LinkedSV, please rebuild the whole thing and see if there are any errors during the whole process. Since LinkedSV did not provide a "make clean" command, my suggestion is that delete the whole folder and re-download and rebuild.

Please let me know the new results after rebuild. Thank you.

bioinfornatics commented 5 years ago

Ok thanks @fangli08 On my side we can not used bundled library thus I can not use the htslib provided here. I can not use on my side htslib lesser than 1.6 as your tools call samtools sort with -t parameter. So here I build with samtools 1.9 and htslib 1.9 and that seem to be not compatible with your tool see the commit: https://github.com/samtools/htslib/commit/95b1034e93b345fef7aa4980750ba1dedde2c284

so:

I confirm if I build we bundled htslib 1.3 that work which mean that is not compatible with recent release.

I will be very happy to see the support of htslib 1.6 or 1.9

fangli80 commented 5 years ago

Thank you for your information. I will test on htslib 1.9 tomorrow.

fangli80 commented 5 years ago

Hi, I have tested on htslib-1.9 and it works with no errors. My commands are:

cd /home/fangl/
git clone https://github.com/WGLab/LinkedSV.git
cd /home/fangl/LinkedSV/src
rm -r htslib-1.3
wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2
tar xaf htslib-1.9.tar.bz2
cd htslib-1.9/
mkdir ../lib
./configure --disable-lzma --disable-bz2 CPPFLAGS=-I/home/fangl/local/include LDFLAGS=-L/home/fangl/local/lib --prefix=/home/fangl/LinkedSV/src/
make
make install ## htslib is installed to /home/fangl/LinkedSV/src/ and the `libhts.a` file is under `/home/fangl/LinkedSV/src/lib`

cd /home/fangl/LinkedSV/src
gcc -g -O0 -std=c99 -I ./include -L ./lib/ -o output_bam_coreinfo  output_bam_coreinfo.c               -l hts -l z -l m -l pthread

check the version of htslib:

$ ls /home/fangl/LinkedSV/src/bin 
bgzip  htsfile  tabix
$ ./htsfile  --version 
htsfile (htslib) 1.9
Copyright (C) 2018 Genome Research Ltd.

Run output_bam_coreinfo on your bam:

/home/fangl/LinkedSV/src/output_bam_coreinfo B00IF30.samtools1.9.test.header_mod.bam >  B00IF30.samtools1.9.test.header_mod.bam.core.sam

The output sam file was generated without warnings or errors.

Could you please try the exact same command and see what you get. By the way, what environment are you using?

fangli80 commented 5 years ago

although there are differences between 1.3 and 1.9, the bam_get_aux(b) macro is the same.

bioinfornatics commented 5 years ago

you get any problem with latest htslib 1.9 ?

fangli80 commented 5 years ago

I haven't tested on all codes of LinkedSV. But for the output_bam_coreinfo program, I didn't get any problem yet.

fangli80 commented 5 years ago

One difference is that, if using htslib-1.9, the htslib should be dynamically linked. If I remove the .so files, output_bam_coreinfo cannot be compiled.