Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
57 stars 7 forks source link

Support for pipes #26

Open gcambray opened 1 year ago

gcambray commented 1 year ago

Hi Daniel,

thanks for the tool, just starting to use it instead of UMI-tools and it goes much faster!

At the moment, I'm using it in fastq mode as part of a pipeline in python. As I just want the tool to run on UMI -not the read sequence- I do create a file where I write a fastq of the UMI parsed in a previous step. Instead of writing/reading all the data to/from a file, I'd rather pipe the data in UMIcollapse. Likewise, I'd rather pipe the data out instead of reading from the generated file and deleting it.

Would be great to support e.g. the '-' notation for arg -i and -o to specify reading/writting from stdin/stdout respectively. Would this be possible to implement?

In the mean time I tried to emulate this by passing /dev/fd/n 'files' (on unix) as arguments to both -i and -o. This works great until I use the --tag option, in which case I do not receive anything on the output stream. If I provide a real file as -i, then I get an output in the stream. I suspect that internally the input is tagged to produce the output and that somehow doesn't work if input is a stream...

With thanks and best regards

Daniel-Liu-c0deb0t commented 1 year ago

For tracking clusters with --tag, two passes need to be made over the input. Therefore, this is only possible with an input file. The reason why UMICollapse is designed this way is to avoid having to load all the reads into memory in one pass. I would suggest using a temporary file as input.

gcambray commented 1 year ago

Many thanks for the explanation — yes I indeed turned to temporary file as input! Piping the output is still possible. Best.

Guillaume Cambray, PhD

Team Leader 'Synthetic, Functional and Evolutionary Genomics'

Center for Structural Biochemistry (CBS) CNRS-INSERM-Université de Montpellier

+33 6 08 86 06 89 @. / ResearchGate  / GScholar / LinkedIn On 6 Jun 2023 at 23:34 +0200, Daniel Liu @.>, wrote:

For tracking clusters with --tag, two passes need to be made over the input. Therefore, this is only possible with an input file. The reason why UMICollapse is designed this way is to avoid having to load all the reads into memory in one pass. I would suggest using a temporary file as input. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>