JoseBlanca / seq_crumbs

Little sequence file utilities meant to work within Unix pipelines
Other
37 stars 10 forks source link

pair_matcher generates huge files #10

Open coreywischmeyer opened 10 years ago

coreywischmeyer commented 10 years ago

I'm attempting to use seq_crumbs for a pipeline to take sff files and convert them into adaptor trimmed fastq files and I'm finding that pair matcher is making files larger than the original. One run I had to stop because the orphan file was 30gigs.

I ran it again to illustrate the error:

-rw-rw-r--+ 1 cwischmeyer ooo 1.6G Jul 2 15:32 orphan -rw-rw-r--+ 1 cwischmeyer ooo 1.0M Jul 2 15:32 test -rw-rw-r--+ 1 cwischmeyer ooo 326M Jul 2 11:16 test.split.fastq

Also when I do a: grep ^@test | sort | uniq -c I am finding that some reads are being written to the orphan file thousands of times.

This bug appears (for me) in both the binary and the source versions.

pziarsolo commented 10 years ago

Hi Corey, I need some more information. Which version of seq_crumbs are you using? We haven't made a new release in a long time and there are a lot of bugfixes in HEAD.

I am trying to reproduce the error with a test file and HEADr and I can't. Could you send me the exact command used and the input file?

Thanks

coreywischmeyer commented 10 years ago

I got the bug out of the source for seq_crumbs 0.1.8 as well as the binary available from the COMAV site (also 0.1.8). Sadly the data that I'm using cannot be sent along. If you aren't getting the bug it maybe a problem with my python setup, I did have some trouble getting it to work.

-Corey

pziarsolo commented 10 years ago

Hi Corey, I think the bug it is solved in github master repository. Could you test it? p.

binzo21 commented 10 years ago

I don't think the bug is solved. I am using version seq_crumbs-0.1.8-x64-linux. The command I gave was:

pair_matcher -o pairs.1.454.fastq -p orphan_1.454.fastq 1.454Reads.qual.fastq

The input file size for 1.454Reads.qual.fastq was 327.9 MiB The file size for orphan_1.454.fastq was 97.0 GiB, at the time my account ran out of memory (I was doing four files at the same time, and all produced similar orphan file sizes).

The issue is that the same read appears to be printing multiple times in the orphan file:

grep -c '@HD6LUZQ01AJRPA' orphan_1.454.fastq 27237

pziarsolo commented 10 years ago

Hi Binzo. Are you using githubs master branch? peio

binzo21 commented 10 years ago

Not totally sure- I may have gotten it from http://bioinf.comav.upv.es/ I can't remember

Not sure if the file name can deduce this for you, but the file I downloaded was "seq_crumbs-0.1.8-x64-linux.tar.gz"

I'll reinstall the version on github and let you know if the problem goes away.

Lindsay

On 10/17/14, Peio wrote:

Hi Binzo. Are you using githubs master branch? peio

— Reply to this email directly or view it on GitHub(https://github.com/JoseBlanca/seq_crumbs/issues/10#issuecomment-59475399).