Salmon wants pairs in different files

ctb commented 8 years ago

It would be nice to be able to specify sequences in interleaved form as well.

ctb commented 8 years ago

In particular, this would enable streaming behavior => transcript quantification, downstream of khmer error trimming etc. c.f. http://khmer.readthedocs.org/en/stable/whats-new-2.0.html cc @MacManes @Blahah

rob-p commented 8 years ago

@ctb — One thing that would be required for this (apart from some engineering of the command-line parsing / validation code) is a trustworthy, efficient, multithreaded FAST(A/Q) parser for interleaved format reads. Right now, Salmon (& Sailfish, &RapMap, & most of the other HTS-centric methods we're developing) use the Jellyfish 2 read parser. I've made this choice since it's fairly simple to use, yet provides nice parallel performance and, most importantly, is fairly well-tested and trust-worthy. Can you suggest a reliable, well-tested, concurrency-enabled library for parsing reads in interleaved format?

macmanes commented 8 years ago

how about https://github.com/lh3/seqtk?

rob-p commented 8 years ago

@macmanes — Heng's code is very well-tested, but afaik, completely serialized. Of course, that's nothing that we couldn't handle internally by throwing the reported reads into our concurrent queue. Actually, I think that the Jellyfish 2 parser (for a single FAST(A/Q) file) would be easy to make work in this context. The trick is to require that the read "batches" always end on an even-indexed boundary, so that we never have an (interleaved) read pair spit across batch boundaries. I'm not sure how easy or difficult that is to enforce. I might just ask Guillaume about the best way to enforce this.

macmanes commented 8 years ago

I'd be interested in seeing runtime differences between the 2 doing equivalent tasks. My experience with seqtk is that it is very fast for the tasks I use it for (mostly interleaving reads).

rob-p commented 8 years ago

I, too, would like to see the relative performance of the two libraries. The only challenge is in making the comparison apples-to-apples (i.e. enabling multi-threaded parsing in seqtk with minimal overhead — a concurrent queue is cheap, but not free).

ctb commented 8 years ago

On Sun, Nov 01, 2015 at 06:15:19AM -0800, Rob Patro wrote:

I, too, would like to see the relative performance of the two libraries. The only challenge is in making the comparison apples-to-apples (i.e. enabling multi-threaded parsing in seqtk with minimal overhead ??? a concurrent queue is cheap, but not free).

Other points worth considering:

there's a runtime overhead to constantly changing sequencing formats. Some programs want split, others want interleaved. We've settled on interleaved because it enables streaming, which is a major win (2-4x performance); and also because having one file is better than having 2 or 4.
the management overhead to keeping track of many files is less for experts, but is pretty significant for beginners. Enabling multiple input formats ++.

So I think it'd be great to have the basic functionality, identify where there are performance problems, and then simply note them for future ;).

I would like to enable -1 and -2 in khmer scripts, but for our usual use cases (multiple sequencing files being normalized and/or partitioned and/or error trimmed) the command line syntax is too confusing ATM.

macmanes commented 8 years ago

Agree with the pain associated with diff formats - it's for this reason that I yammer on about streaming and tools taking in reads on the stdin/writing to the stdout where possible. I'm always looking for alternatives that allow for streaming (e.g., using Skewer (https://github.com/relipmoc/skewer) over Trimmomtic). I'm hopeful natural selection will weed out non-streaming/interleaving tools..

Sorry for off-topic ranting..

ctb commented 8 years ago

+1 rant ;)

mdshw5 commented 8 years ago

If salmon would just read stdin using a symbolic - then you should be able to do something like:

interleaved_fastq_emitter | salmon --mates1 - --mates2 -

...as long as the fastq reading logic is operating on 4-lines at a time. This way there's no new arguments added to salmon.

rob-p commented 8 years ago

:+1:

vals commented 8 years ago

Having spent a couple of hours now trying to streamingly pass two streams without having to mess with FIFOs I also think an interleaved mode in the fastq parsing would amazing..

rob-p commented 8 years ago

@mdshw5 --- if you replaced - with /dev/fd/0, would that work right now?

mdshw5 commented 8 years ago

I think it might as long as the IO is chunking 4 lines at a time from each file.

vals commented 8 years ago

I just tried the /dev/fd/0 approach.

First I ran

salmon quant -i /nfs/research2/teichmann/reference/mus-musculus/salmon/quasi/mouse_cdna_38.p3.78_repbase_ercc.fa -l IU -1 reads_1.fastq -2 reads_2.fastq -o normal_salmon_out

In this case the following is the content of the salmon_quant.log

[2016-01-03 00:33:37.001] [jointLog] [info] parsing read library format
[2016-01-03 00:33:37.510] [jointLog] [info] Loading Quasi index
[2016-01-03 00:33:53.646] [jointLog] [info] done
[2016-01-03 00:34:14.501] [jointLog] [info] Computed 13742 rich equivalence classes for further processing
[2016-01-03 00:34:14.501] [jointLog] [info] Counted 335230 total reads in the equivalence classes
[2016-01-03 00:34:14.501] [fileLog] [info]
At end of round 0
==================
Observed 3835342 total fragments (3835342 in most recent round)

[2016-01-03 00:34:20.992] [jointLog] [warning] Only 335230 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.

[2016-01-03 00:34:20.992] [jointLog] [info] Mapping rate = 8.74055%

[2016-01-03 00:34:20.992] [jointLog] [info] finished quantifyLibrary()
[2016-01-03 00:34:20.992] [jointLog] [info] Starting optimizer
[2016-01-03 00:34:21.028] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2016-01-03 00:34:21.030] [jointLog] [info] iteration = 0 | max rel diff. = 23.4889
[2016-01-03 00:34:21.167] [jointLog] [info] iteration = 100 | max rel diff. = 0.150549
[2016-01-03 00:34:21.304] [jointLog] [info] iteration = 200 | max rel diff. = 0.0517672
[2016-01-03 00:34:21.447] [jointLog] [info] iteration = 300 | max rel diff. = 0.0368208
[2016-01-03 00:34:21.578] [jointLog] [info] iteration = 400 | max rel diff. = 0.0237254
[2016-01-03 00:34:21.705] [jointLog] [info] iteration = 500 | max rel diff. = 0.0147784
[2016-01-03 00:34:21.834] [jointLog] [info] iteration = 600 | max rel diff. = 0.0131134
[2016-01-03 00:34:21.961] [jointLog] [info] iteration = 700 | max rel diff. = 0.0130094
[2016-01-03 00:34:22.092] [jointLog] [info] iteration = 800 | max rel diff. = 0.0100546
[2016-01-03 00:34:22.196] [jointLog] [info] iteration = 882 | max rel diff. = 0.00861472
[2016-01-03 00:34:22.205] [jointLog] [info] Finished optimizer
[2016-01-03 00:34:22.205] [jointLog] [info] writing output

[2016-01-03 00:34:22.433] [jointLog] [warning] NOTE: Read Lib [( reads_1.fastq, reads_2.fastq )] :

Greater than 5% of the alignments (but not, necessarily reads) disagreed with the provided library type; check the file: normal_salmon_out/libFormatCounts.txt for details

Then I ran

cat all_reads.fastq | salmon quant  -i /nfs/research2/teichmann/reference/mus-musculus/salmon/quasi/mouse_cdna_38.p3.78_repbase_ercc.fa -l IU -1 /dev/fd/0 -2 /dev/fd/0 -o interlaced_salmon_out

Now I get

[2016-01-03 00:36:48.844] [jointLog] [info] parsing read library format
[2016-01-03 00:36:49.995] [jointLog] [info] Loading Quasi index
[2016-01-03 00:37:08.293] [jointLog] [info] done
[2016-01-03 00:37:25.106] [jointLog] [info] Computed 23484 rich equivalence classes for further processing
[2016-01-03 00:37:25.106] [jointLog] [info] Counted 667333 total reads in the equivalence classes
[2016-01-03 00:37:25.106] [fileLog] [info]
At end of round 0
==================
Observed 3060000 total fragments (3060000 in most recent round)

[2016-01-03 00:37:31.905] [jointLog] [warning] Only 667333 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.

[2016-01-03 00:37:31.905] [jointLog] [info] Mapping rate = 21.8083%

[2016-01-03 00:37:31.905] [jointLog] [info] finished quantifyLibrary()
[2016-01-03 00:37:31.905] [jointLog] [info] Starting optimizer
[2016-01-03 00:37:33.275] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2016-01-03 00:37:33.279] [jointLog] [info] iteration = 0 | max rel diff. = 35.6186
[2016-01-03 00:37:33.533] [jointLog] [info] iteration = 100 | max rel diff. = 0.12044
[2016-01-03 00:37:33.755] [jointLog] [info] iteration = 200 | max rel diff. = 0.0493504
[2016-01-03 00:37:33.970] [jointLog] [info] iteration = 300 | max rel diff. = 0.0275491
[2016-01-03 00:37:34.194] [jointLog] [info] iteration = 400 | max rel diff. = 0.0216294
[2016-01-03 00:37:34.418] [jointLog] [info] iteration = 500 | max rel diff. = 0.0214024
[2016-01-03 00:37:34.640] [jointLog] [info] iteration = 600 | max rel diff. = 0.0132335
[2016-01-03 00:37:34.850] [jointLog] [info] iteration = 700 | max rel diff. = 0.0132363
[2016-01-03 00:37:35.066] [jointLog] [info] iteration = 800 | max rel diff. = 0.0122673
[2016-01-03 00:37:35.287] [jointLog] [info] iteration = 900 | max rel diff. = 0.012951
[2016-01-03 00:37:35.510] [jointLog] [info] iteration = 1000 | max rel diff. = 0.0131479
[2016-01-03 00:37:35.643] [jointLog] [info] iteration = 1062 | max rel diff. = 0.00666119
[2016-01-03 00:37:35.653] [jointLog] [info] Finished optimizer
[2016-01-03 00:37:35.653] [jointLog] [info] writing output

[2016-01-03 00:37:35.920] [jointLog] [warning] NOTE: Read Lib [( /dev/fd/0, /dev/fd/0 )] :

Greater than 5% of the alignments (but not, necessarily reads) disagreed with the provided library type; check the file: interlaced_salmon_out/libFormatCounts.txt for details

(FYI, this might be a failed sample, I just grabbed on at random, hence the low mapping rate).

There's a discrepancy of about 800k observed reads. The number of mapped fragments is roughly twice for the interleaved version. So it seems this strategy doesn't work right now.

rob-p commented 8 years ago

I can imagine that this approach may not work with the way jellyfish parses paired-end files. I'll take a look at the parsing code.

ctb commented 8 years ago

Right, any buffering at all is going to kill this.

rob-p commented 8 years ago

Yea; so I think the only way to do this currently (without me modifying the jellyfish parser) is to just use 2 fifos. I could put together a simple bash or python script together for this if there is interest (until we support interleaved format natively, which I'll add to the feature list).

vals commented 8 years ago

The most annoying thing about FIFOs for me is the need to remove them after having used them. Especially if something fails between making them and finishing. Do you know a good solution for this?

ctb commented 8 years ago

On Sun, Jan 03, 2016 at 10:20:29AM -0800, Valentine Svensson wrote:

The most annoying thing about FIFOs for me is the need to remove them after having used them. Especially if something fails between making them and finishing. Do you know a good solution for this?

It'd be straightforward, I think, to provide an option in khmer's read processing scripts to output as FIFOs; then all of this could be handled programatically inside the script.

vals commented 8 years ago

I haven't kept up with khmer so I'm not sure what it does. Can I do something equivalent of this with khmer?

mkfifo /tmp/1.fastq && \
mkfifo /tmp/2.fastq && \
samtools sort -n input.cram | samtools fastq -1 /tmp/1.fastq -2 /tmp/2.fastq & salmon -i index -l IU -1 /tmp/1.fastq -2 /tmp/2.fastq -o /tmp/salmon_out && \
mv /tmp/salmon_out && \
rm /tmp/1.fastq && \
rm /tmp/2.fastq

I just want to be able to do these sorts of things without wrapper scripts that obfuscate and complicate things... But maybe a robust bash wrapper is the best way to go.

rob-p commented 8 years ago

I'm also not quite convinced about the superiority of either approach. On one hand, it would, of course, be nice to have sailfish and salmon natively support as many file formats as possible. On the other hand, every extra file format increases the complexity of the parsing code paths, probably putting us in some technical debt, and creating more opportunities for bugs to creep in. From this perspective, anything that can be handled cleanly and robustly from the command line seems a win (of course, this is only true if it can actually be handled cleanly and robustly).

mdshw5 commented 8 years ago

There should be easy ways to handle reading line-buffered input from two file descriptors, where both file descriptors could be identical, and then passing these streams internally to buffers to be chunked for multithreaded processing. This would give you one code path for ingesting data, and the command line interface could remain the same as it is currently, with the possible addition of mapping the - symbol to /dev/fd/0. Is there really much to be gained from buffering all input in byte chunks up front? Remembering that unix pipes are buffered somewhat by default anyway? There has to be an acceptable way to handle line-based input in a more flexible way. In Python I would do:

import argparse

example_parser = argparse.ArgumentParser()
example_parser.add_argument('-fq1', type=argparse.FileType('r'))
example_parser.add_argument('-fq2', type=argparse.FileType('r'))
args = parser.parse_args()

for line1, line2 in zip(args.fq1, args.fq2):
  do_stuff_with_lines()

You could then call the program flexibly:

$ example -fq1 file1.fq -fq2 file2.fq
$ example -fq1 <(gzip -dc file1.fq.gz) -fq2 <(gzip -dc file2.fq.gz)
$ other_interleaved_process | example -fq1 - -fq2 -

The caveat for the code above is that you would want to replace argparse.FileType with some class that reads 4 lines at a time - I'm sure there's no shortage of Python FASTQ readers that do that. And I know that you're looking for C++ libraries that perform well for your purposes, and my Python example is just a toy, but I think designing the option parser to at least accept streams and file-like objects and handle them using the same code path would be a worthy reason to refactor a bit.

rob-p commented 8 years ago

Actually, @mdshw5 --- it's not quite clear to me why the parser isn't doing the right thing in this case. If you take a look at how the paired-end sequence parser is actually populating the internal buffer (e.g. here), it is reading one entry from stream1 and then one entry from stream2. I'm guessing there may be some issue with having two different handles open to the same fifo? However, that doesn't seem like it should be a problem. Given the way the code is actually reading from the different streams, it's not clear to me why it's not currently working as expected. I'll try and take a deeper look.

mdshw5 commented 8 years ago

I actually didn't test it :). I'll confirm the current behavior tomorrow. Thanks for following up on this!

On Jan 3, 2016, at 8:37 PM, Rob Patro notifications@github.com wrote:

Actually, @mdshw5 --- it's not quite clear to me why the parser isn't doing the right thing in this case. If you take a look at how the paired-end sequence parser is actually populating the internal buffer (e.g. here), it is reading one entry from stream1 and then one entry from stream2. I'm guessing there may be some issue with having two different handles open to the same fifo? However, that doesn't seem like it should be a problem. Given the way the code is actually reading from the different streams, it's not clear to me why it's not currently working as expected. I'll try and take a deeper look.

— Reply to this email directly or view it on GitHub.

rob-p commented 8 years ago

Here's a sketch of a shell script-based solution that might work. It relies on this shell script to do the de-interleaving (but it can use whichever tool we might decide is best for the job). You'd run it with the interleaved file like so:

./runner.sh salmon quant -i index -l IU --interleaved interleaved.fq -o interleaved_quant

Basically, the script checks to see if the --interleaved parameter is present. If so, it handles making the fifos and constructing the proper salmon command with them. Otherwise, if there is no --interleaved file, it simply runs the command as given.

vals commented 8 years ago

That works!

I wrote something very similar yesterday but using a here document generated from a Makefile in to job scheduler, and I couldn't get it to work. I'll file that under overcomplicating things...

rob-p commented 8 years ago

Yup; I think there are some potential places for improvement (e.g. the interleaved splitting code could be incorporated directly into this script, and the parsing could be improved to handle multiple interleaved files directly), but it seems to work pretty well. Also, when making this, I learned about the trap command, which should do what we want in terms of ensuring that any created fifos are cleaned up.

rob-p commented 8 years ago

I've updated the runner script so that it is now self contained (i.e. it no longer relies on deinterleave_fastq.sh as an external script). Once we're able to modify it to work with multiple input interleaved fastq files, I'll add it to the official repository.

rob-p commented 8 years ago

I'll be adding this script to the next release (which should happen very soon). At that point, I'll close this issue unless anyone objects. I may still look into supporting interleaved format natively in the future, but for now this approach seems to work well.

mr-c commented 8 years ago

@rob-p I've had good experiences with https://github.com/seqan/seqan and I like their team

COMBINE-lab / salmon

Salmon wants pairs in different files #29