Closed yuansliu closed 6 years ago
Hi @yuansliu Thanks for reporting this, I reformatted the question a bit for readability. I am unable to reproduce the error for the paired end file. Did not get a chance to run the single end mode.
Here is what I ran.
./quark.sh -1 SRR635193_1.fastq.gz -2 SRR635193_2.fastq.gz -i index_dir -p 20 -o SRR635193_q
to encode, and to decode
`./quark.sh -d decode -l P -i SRR635193_q -p 10 -o SRR635193_q_d`
Both ran without error.
Two things to make sure, that 1. you are on master
branch. and did git pull to latest, 2. the index is correct. I would strongly suggest to first run the above two commands for the paired end files.
Sorry for the inconvenience.
Hi @yuansliu The proper way to solve this problem to me is to write a docker image with the program. But that would take some time. My code base is updated with the master branch.
In the mean time please make sure that you have the correct fastq.gz files.
Hi @hiraksarkar A simple bug as following. Please help me to check it.
In the file quark.sh, decoder in line 147 and line 155, only three parameters.
While in the main function of the file BinaryDecoder.cpp, it need four parameters as in the code line 644. In the line 638, vector args only three index, args[0] as input directory, args[1] as output directory, args[2] as library type. However, in line 656, it visits args[3]. I think an error (out of index) always exist here. I really can not run the decompression program as well.
Does these two files in master branch? And I want to known the script your used in quark.sh.
Okay just realized quark.sh
is old and not equipped with quality scores. Please use the instructions given in README to run the snakemake
based pipeline.
I will update the shell file as well soon.
Hi, @hiraksarkar
Unfortunately, even snakemake is not working. For quark.sh, I can do some changes by adding a parameter. But the C++ source code is the same. As following picture.
I am wondering that the version your used to compress/decompress.
I am very sure that the fastq.gz is correct.
More information provided for you fixing bugs.
When compressing paired-end reads. Command is ./quark.sh -1 SRR635193_1.fastq.gz -2 SRR635193_2.fastq.gz -i transindex -p 24 -o SRR635193_pe_quark
In line 904, visiting qualities[0] is an error (out of index). Because qualities is an empty vector. Visiting 0 index is not allowed.
Following is your response. "You are absolutely right, those lines are useless, although should not stop the program from running, I put those in order to test the inclusion quality scores, which never should come in production code. I will definitely remove the part related to quality score." It should be never appear in production code. So just delete them.
"The second problem: line 913 of source file "GZipWriter.cpp". Why the length of reads is (seq.size()-3)/4? After writing unmapped sequences, I find that the two unmapped files contain reads much short than original length of reads.
_No, that should not happen, while keeping quality scores I created the seq as a concatenation of 4 strings
Therefore, line 913 is an error.
Thanks for the heads up, I think all these bugs are introduced when I tested the quality score feature. Definitely fix them as soon as I get some cycle. Unfortunately I am away now, so did not get enough window of time yet. If you want you can pull a fork and push the changes and I would merge those. Otherwise, I would get back to you Saturday with the fixes. If that's too late let me know.
Thank you very much. I just list more details. I hope the details can give you some suggestion. I am waiting for your normal codes.
Hi, @hiraksarkar and Prof. @rob-p
Just a friendly reminder.
I still waiting for your codes to run some experiments.
Best,
Yuansheng
@yuansliu working on it, would be pushed before end of tonight (Pacific time)
Hi @yuansliu,
After solving the peripheral bugs of wrong calling of function and useless comments, I tested the code on a sample dataset with 10K reads from the above file and it worked. When I tested on the entire file, it failed again, while I am still trying to track down the bug, I strongly suggest you to start benchmarking with an older version of the repo.
That is you can git reset --hard 6527bee8288b9cc7dab68f5551c1bf31cd719c81
after do a clean clone of the repo. That version would not have quality scores though. In the mean time I will keep investigating the problem. Sorry for the inconvenience.
I have also pushed the recent code to the repo, just to make corrections that you pointed out (basically removing the prints and adding support to quark.sh).
Hi @hiraksarkar
This old version looks better. But, when I try to compress SRR635193_1.fastq, it can not decompress.
It works OK on the data SRR1294116.fastq (Both compression and decompression is ok)
Before, I always try to compress/decompress SRR635193_1.fastq, SRR635193_2.fastq. So I always have bugs. I guess that these program works well on some dataset.
Hi @yuansliu,
I'm sorry to see you having these problems. @hiraksarkar --- is there some way to determine if this is resulting from the FASTQ file (as opposed to FASTA) or not? Also, can we verify that this particular version, to which we directed @yuansliu works on all of the datasets we used in the paper? Is there any issue with variable read lengths?
Hi @rob-p ,
Not sure, but we tested against SRR635193 too, and that should not be a problem. It seems like the error has something to do with the quality scores.
Hi Prof. Patro @rob-p
I have sent the compressed file (cannot decompress as the above picture) to Hirak.
Thank you very much for both your helps.
after some false attempts I found out the bug that is causing the quark failures. I also tracked why it was running from my old folder. A bit of background is needed to discuss the problem and the solution. When we developed quark we assumed that there will be at max 255 islands for each equivalence class, and from our extensive testing, we found that is sufficient for compressing reads. The recent index of human transcriptome has added some new set of sequences to the reference, which leads to more islands for each equivalence class. For this particular example the when the number of islands becomes more than 255 it overflows and creates a truncated island id, which might have a shorter number of nucleotides.
When I was running the program in my old machine, I had the older reference transcriptome for which the number of islands would never have crossed 255 and it ran. Long story short, an immediate fix of the problem is to change the size of while writing and reading the island files. Now on hindside, this will increase the compressed file size. If you are using quark as a competing program in your benchmark please feel free to use the new sizes, it would be great if you as well mention the reason.
As a solution, I just pushed a copy of the code that would work with the aforementioned file that we were stuck in. This is how you should run it.
./quark.sh -1 SRR635193_1.fastq.gz -2 SRR635193_2.fastq.gz -i index_dir -p 10 -o encoded_folder -q 1
and decompress
./quark.sh -d garbage -l P -i encoded_folder -p 10 -o decoded_folder -q 1
I would advise you to NOT run a portion of paired end file as single end, that might break things, paired end files are compressed on the basis of mapping information obtained from underlying mapper Rapmap. So always run paired end files as paired end with flag -l P as shown.
Thanks for your patience, I would make the code clean whenever I get a chance. If you have more doubts send them this way.
With the above push I am closing this issue for now, feel free to reopen.
Here I give a detail of my question from clone src to running program. I just re-clone quark from https://github.com/COMBINE-lab/quark.git. (the master branch)
Then I make, there is an error as following.
I think this is caused by the version of spdlog. I fixed the error by replacing an old version of spdlog. Finally,
All is prepared OK. I run
CORRECT.
Then to decompression. I obtain an error as following.
I think this is caused by no quality parameter transferred to $decoder. So I change line 155 of quark.sh "$decoder $inputdir $out S" to "$decoder $inputdir $out S N" I obtain another error.
Then, I tested paired-end reads compression.
I run the command
An error as following.
I think the program is error at line 908 of GzipWriter.cpp.
Here, qualityScore is the default value 'false'. This will cause "qualities[0]" is error.
I delete this three line codes. if xxx. It is useless. And cmake and make again.
I also deleted the line 128 of quark.sh to preserve the unmapped_1.fastq unmapped_2.fastq
Run this command again.
I find that the length of unmapped is not 54.
I think that it is because of the size of 'seq' in line 913 do not contain quality as qualityScore==false. After 'int len = (seq.size()-3)/4;'; only substring is saved.
Thank you very much.