COMBINE-lab / quark

semi-reference-based short read compression
GNU General Public License v3.0
11 stars 5 forks source link

Cannot decompress or compress paired-end reads #8

Closed yuansliu closed 6 years ago

yuansliu commented 6 years ago

Here I give a detail of my question from clone src to running program. I just re-clone quark from https://github.com/COMBINE-lab/quark.git. (the master branch)

[yuansliu@titan20 build]$ cmake ..
-- The C compiler identification is GNU 5.3.1
-- The CXX compiler identification is GNU 5.3.1
-- Check for working C compiler: /opt/rh/devtoolset-4/root/usr/bin/cc
-- Check for working C compiler: /opt/rh/devtoolset-4/root/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /opt/rh/devtoolset-4/root/usr/bin/c++
-- Check for working CXX compiler: /opt/rh/devtoolset-4/root/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
Making Default build type
-- Found ZLIB: /usr/lib64/libz.so (found version "1.2.7") 
-- Boost version: 1.53.0
-- Found the following Boost libraries:
--   iostreams
--   filesystem
--   system
--   thread
--   timer
--   chrono
--   program_options
--   serialization
BOOST_INCLUDEDIR = 
BOOST_LIBRARYDIR = 
Boost_FOUND = 1
BOOST INCLUDE DIR = /usr/include
BOOST INCLUDE DIRS = /usr/include
BOOST LIB DIR = /usr/lib64
BOOST LIBRAREIS = /usr/lib64/libboost_iostreams-mt.a;/usr/lib64/libboost_filesystem-mt.a;/usr/lib64/libboost_system-mt.a;/usr/lib64/libboost_thread-mt.a;/usr/lib64/libboost_timer-mt.a;/usr/lib64/libboost_chrono-mt.a;/usr/lib64/libboost_program_options-mt.a;/usr/lib64/libboost_serialization-mt.a
.
.
.
TBB_LIBRARIES = /usr/lib64/libtbb.so;/usr/lib64/libtbbmalloc.so
Boost_LIBRARIES = /usr/lib64/libboost_iostreams-mt.a;/usr/lib64/libboost_filesystem-mt.a;/usr/lib64/libboost_system-mt.a;/usr/lib64/libboost_thread-mt.a;/usr/lib64/libboost_timer-mt.a;/usr/lib64/libboost_chrono-mt.a;/usr/lib64/libboost_program_options-mt.a;/usr/lib64/libboost_serialization-mt.a
-- Configuring done
-- Generating done
-- Build files have been written to: /home/yuansliu/Data/quark/build

Then I make, there is an error as following.

[ 71%] Building CXX object src/CMakeFiles/sailfish_core.dir/SailfishIndexer.cpp.o
cd /home/yuansliu/Data/quark/build/src && /opt/rh/devtoolset-4/root/usr/bin/c++    -pthread -funroll-loops -fPIC -fomit-frame-pointer -Ofast -DHAVE_ANSI_TERM -DHAVE_SSTREAM -DRAPMAP_SALMON_SUPPORT -Wall -g -std=c++11 -Wreturn-type -Werror=return-type -static-libstdc++ -Wno-deprecated-register -Wno-unused-local-typedefs -I/home/yuansliu/Data/quark/tests -I/home/yuansliu/Data/quark/include -I/home/yuansliu/Data/quark/include/eigen3 -I/home/yuansliu/Data/quark/external -I/home/yuansliu/Data/quark/external/cereal/include -I/home/yuansliu/Data/quark/external/install/include -I/home/yuansliu/Data/quark/external/install/include/rapmap -I/home/yuansliu/Data/quark/external/install/include/jellyfish-2.2.3    -o CMakeFiles/sailfish_core.dir/SailfishIndexer.cpp.o -c /home/yuansliu/Data/quark/src/SailfishIndexer.cpp
In file included from /home/yuansliu/Data/quark/include/SailfishIndex.hpp:16:0,
                 from /home/yuansliu/Data/quark/src/SailfishIndexer.cpp:62:
/home/yuansliu/Data/quark/include/SailfishIndexVersionInfo.hpp:4:28: fatal error: spdlog/fmt/fmt.h: No such file or directory
compilation terminated.
make[2]: *** [src/CMakeFiles/sailfish_core.dir/SailfishIndexer.cpp.o] Error 1

I think this is caused by the version of spdlog. I fixed the error by replacing an old version of spdlog. Finally,

/usr/bin/cmake -E cmake_progress_report /home/yuansliu/Data/quark/build/CMakeFiles  60
[100%] Built target unitTests  
make[1]: Leaving directory "/data/yuansliu/quark/build"  
/usr/bin/cmake -E cmake_progress_start /home/yuansliu/Data/quark/build/CMakeFiles 0`

All is prepared OK. I run

./quark.sh -r SRR635193_1.fastq.gz -i transindex -p 24 -o SRR635193_1_quark".

CORRECT.

Then to decompression. I obtain an error as following.

[yuansliu@titan20 quark]$ ./quark.sh -d decode -l S -i SRR635193_1_quark -p 24 -o SRR635193_1_quark_dec
/data/yuansliu/quark
10
0
[ decode] => { }
[ input] => { m_ }
[ output] => { um_ }
[2018-07-23 16:38:36.669] [jointLog] [info] reading seq buffer with command plzip -d -c -n 17 m_.seqs.lz
[2018-07-23 16:38:36.673] [jointLog] [info] reading offset buffer with command plzip -d -c -n 1 m_.offs.lz
[2018-07-23 16:38:36.675] [jointLog] [info] reading Ns buffer with command plzip -d -c -n 1 m_.nlocs.lz
[2018-07-23 16:38:36.773] [jointLog] [info] opened flip file "m_.flips.lz"
[2018-07-23 16:38:36.776] [jointLog] [info] bucket string length = 15
[2018-07-23 16:38:36.776] [jointLog] [info] effective read length = 54
[2018-07-23 16:38:36.776] [jointLog] [info] num onsies = 
[2018-07-23 16:38:38.101] [jointLog] [info] done with onsies.
[2018-07-23 16:38:38.101] [jointLog] [info] reset effective read length to 10
wrote read 1900000[2018-07-23 16:38:42.341] [jointLog] [info] reached EOF in seq
/home/yuansliu/Data/quark

 Starting Quark Decoder Module ...

 Input directory : SRR635193_1_quark
./quark.sh: line 159: 95129 Segmentation fault      $decoder $inputdir $out S
"

I think this is caused by no quality parameter transferred to $decoder. So I change line 155 of quark.sh "$decoder $inputdir $out S" to "$decoder $inputdir $out S N" I obtain another error.

Input directory : SRR635193_1_quark
 Output directory : SRR635193_1_quark_dec
 No quality score file found 
 Library type : SINGLE END
Sequence File : { "SRR635193_1_quark/reads.quark.lz" }
Offset File: { "SRR635193_1_quark/offset.quark.lz" }
Island file : { "SRR635193_1_quark/islands.txt" }
 50% [||||||||||||||||||||||||||||||                              ]terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 101) > this->size() (which is 58)
./quark.sh: line 159: 96483 Aborted                 $decoder $inputdir $out S N

Then, I tested paired-end reads compression.

I run the command

./quark.sh -1 SRR635193_1.fastq.gz -2 SRR635193_2.fastq.gz -i transindex -p 24 -o SRR635193_pe_quark

An error as following.

[2018-07-23 16:52:20.109] [jointLog] [info] Gathered fragment lengths from all threads
[2018-07-23 16:52:20.109] [jointLog] [info] Computing effective length factors --- max length = 1000
[2018-07-23 16:52:20.109] [jointLog] [info] finished computing effective length factors
[2018-07-23 16:52:20.109] [jointLog] [info] mean fragment length = 163.724
[2018-07-23 16:52:20.109] [jointLog] [info] Estimating effective lengths
[2018-07-23 16:52:20.578] [jointLog] [info] Computed 346365 rich equivalence classes for further processing
[2018-07-23 16:52:20.578] [jointLog] [info] Counted 24885954 total reads in the equivalence classes 
[2018-07-23 16:52:28.903] [jointLog] [info] Computed 346365 rich equivalence classes for further processing
[2018-07-23 16:52:28.903] [jointLog] [info] Counted 24885954 total reads in the equivalence classes 

 start writing unmapped
./quark.sh: line 159: 102463 Segmentation fault      $quark quant -i $ind -l IU -1 <(gunzip -c $left) -2 <(gunzip -c $right) -p $th -o $out
/home/yuansliu/Data/quark/SRR635193_pe_quark/aux

I think the program is error at line 908 of GzipWriter.cpp.

if(qualities[0] == "GGGGGGAGGGGGGGTTGTTAGGGGGTCGGAGGAAAAGGTTGGGGAACAGCTAAATAGGTTGTTGTTGATTTGGTTA"

Here, qualityScore is the default value 'false'. This will cause "qualities[0]" is error.

I delete this three line codes. if xxx. It is useless. And cmake and make again.

I also deleted the line 128 of quark.sh to preserve the unmapped_1.fastq unmapped_2.fastq

Run this command again.

[yuansliu@titan20 aux]$ head unmapped_1.fastq 
@0
CTTCGTGTGGCACTATAGCCTTTAGC
+
IIIIIIIIIIIIIIIIIIIIIIIIII

I find that the length of unmapped is not 54.

I think that it is because of the size of 'seq' in line 913 do not contain quality as qualityScore==false. After 'int len = (seq.size()-3)/4;'; only substring is saved.

Thank you very much.

hiraksarkar commented 6 years ago

Hi @yuansliu Thanks for reporting this, I reformatted the question a bit for readability. I am unable to reproduce the error for the paired end file. Did not get a chance to run the single end mode.

Here is what I ran.

./quark.sh -1 SRR635193_1.fastq.gz -2 SRR635193_2.fastq.gz -i index_dir -p 20 -o SRR635193_q

to encode, and to decode

`./quark.sh -d decode -l P -i SRR635193_q -p 10 -o SRR635193_q_d`

Both ran without error.

Two things to make sure, that 1. you are on master branch. and did git pull to latest, 2. the index is correct. I would strongly suggest to first run the above two commands for the paired end files.

Sorry for the inconvenience.

hiraksarkar commented 6 years ago

Hi @yuansliu The proper way to solve this problem to me is to write a docker image with the program. But that would take some time. My code base is updated with the master branch.

In the mean time please make sure that you have the correct fastq.gz files.

yuansliu commented 6 years ago

Hi @hiraksarkar A simple bug as following. Please help me to check it.

In the file quark.sh, decoder in line 147 and line 155, only three parameters.

quarkshdecodercmd

While in the main function of the file BinaryDecoder.cpp, it need four parameters as in the code line 644. In the line 638, vector args only three index, args[0] as input directory, args[1] as output directory, args[2] as library type. However, in line 656, it visits args[3]. I think an error (out of index) always exist here. I really can not run the decompression program as well.

decodermain

Does these two files in master branch? And I want to known the script your used in quark.sh.

hiraksarkar commented 6 years ago

Okay just realized quark.sh is old and not equipped with quality scores. Please use the instructions given in README to run the snakemake based pipeline.

I will update the shell file as well soon.

yuansliu commented 6 years ago

Hi, @hiraksarkar

Unfortunately, even snakemake is not working. For quark.sh, I can do some changes by adding a parameter. But the C++ source code is the same. As following picture.

quark_snakemake

I am wondering that the version your used to compress/decompress.

I am very sure that the fastq.gz is correct.

yuansliu commented 6 years ago

More information provided for you fixing bugs.

When compressing paired-end reads. Command is ./quark.sh -1 SRR635193_1.fastq.gz -2 SRR635193_2.fastq.gz -i transindex -p 24 -o SRR635193_pe_quark

In line 904, visiting qualities[0] is an error (out of index). Because qualities is an empty vector. Visiting 0 index is not allowed.

Following is your response. "You are absolutely right, those lines are useless, although should not stop the program from running, I put those in order to test the inclusion quality scores, which never should come in production code. I will definitely remove the part related to quality score." gzipwrite900 It should be never appear in production code. So just delete them.

"The second problem: line 913 of source file "GZipWriter.cpp". Why the length of reads is (seq.size()-3)/4? After writing unmapped sequences, I find that the two unmapped files contain reads much short than original length of reads.

_No, that should not happen, while keeping quality scores I created the seq as a concatenation of 4 strings |||._" To verify this, I add a line to print 'seq' as following picture. changed The result is: gzwerror The seq do not contain information of qual.

Therefore, line 913 is an error.

hiraksarkar commented 6 years ago

Thanks for the heads up, I think all these bugs are introduced when I tested the quality score feature. Definitely fix them as soon as I get some cycle. Unfortunately I am away now, so did not get enough window of time yet. If you want you can pull a fork and push the changes and I would merge those. Otherwise, I would get back to you Saturday with the fixes. If that's too late let me know.

yuansliu commented 6 years ago

Thank you very much. I just list more details. I hope the details can give you some suggestion. I am waiting for your normal codes.

yuansliu commented 6 years ago

Hi, @hiraksarkar and Prof. @rob-p

Just a friendly reminder.

I still waiting for your codes to run some experiments.

Best,

Yuansheng

hiraksarkar commented 6 years ago

@yuansliu working on it, would be pushed before end of tonight (Pacific time)

hiraksarkar commented 6 years ago

Hi @yuansliu, After solving the peripheral bugs of wrong calling of function and useless comments, I tested the code on a sample dataset with 10K reads from the above file and it worked. When I tested on the entire file, it failed again, while I am still trying to track down the bug, I strongly suggest you to start benchmarking with an older version of the repo. That is you can git reset --hard 6527bee8288b9cc7dab68f5551c1bf31cd719c81 after do a clean clone of the repo. That version would not have quality scores though. In the mean time I will keep investigating the problem. Sorry for the inconvenience.

I have also pushed the recent code to the repo, just to make corrections that you pointed out (basically removing the prints and adding support to quark.sh).

yuansliu commented 6 years ago

Hi @hiraksarkar

This old version looks better. But, when I try to compress SRR635193_1.fastq, it can not decompress.

It works OK on the data SRR1294116.fastq (Both compression and decompression is ok)

wx20180730-223428 2x

Before, I always try to compress/decompress SRR635193_1.fastq, SRR635193_2.fastq. So I always have bugs. I guess that these program works well on some dataset.

rob-p commented 6 years ago

Hi @yuansliu,

I'm sorry to see you having these problems. @hiraksarkar --- is there some way to determine if this is resulting from the FASTQ file (as opposed to FASTA) or not? Also, can we verify that this particular version, to which we directed @yuansliu works on all of the datasets we used in the paper? Is there any issue with variable read lengths?

hiraksarkar commented 6 years ago

Hi @rob-p ,

Not sure, but we tested against SRR635193 too, and that should not be a problem. It seems like the error has something to do with the quality scores.

yuansliu commented 6 years ago

Hi Prof. Patro @rob-p

I have sent the compressed file (cannot decompress as the above picture) to Hirak.

Thank you very much for both your helps.

hiraksarkar commented 6 years ago

after some false attempts I found out the bug that is causing the quark failures. I also tracked why it was running from my old folder. A bit of background is needed to discuss the problem and the solution. When we developed quark we assumed that there will be at max 255 islands for each equivalence class, and from our extensive testing, we found that is sufficient for compressing reads. The recent index of human transcriptome has added some new set of sequences to the reference, which leads to more islands for each equivalence class. For this particular example the when the number of islands becomes more than 255 it overflows and creates a truncated island id, which might have a shorter number of nucleotides.

When I was running the program in my old machine, I had the older reference transcriptome for which the number of islands would never have crossed 255 and it ran. Long story short, an immediate fix of the problem is to change the size of while writing and reading the island files. Now on hindside, this will increase the compressed file size. If you are using quark as a competing program in your benchmark please feel free to use the new sizes, it would be great if you as well mention the reason.

As a solution, I just pushed a copy of the code that would work with the aforementioned file that we were stuck in. This is how you should run it.

./quark.sh -1 SRR635193_1.fastq.gz -2 SRR635193_2.fastq.gz -i index_dir -p 10 -o encoded_folder -q 1 and decompress

./quark.sh -d garbage -l P -i encoded_folder -p 10 -o decoded_folder -q 1

I would advise you to NOT run a portion of paired end file as single end, that might break things, paired end files are compressed on the basis of mapping information obtained from underlying mapper Rapmap. So always run paired end files as paired end with flag -l P as shown.

Thanks for your patience, I would make the code clean whenever I get a chance. If you have more doubts send them this way.

With the above push I am closing this issue for now, feel free to reopen.