-o option only works for sketching databases, but not samples

bluenote-1577 / sylph

ultrafast taxonomic profiling and genome querying for metagenomic samples by abundance-corrected minhash.

MIT License

185 stars 6 forks source link

-o option only works for sketching databases, but not samples #7

Closed fplazaonate closed 10 months ago

fplazaonate commented 10 months ago

Hi @bluenote-1577,

-o option seems to be ignored while sketching samples.

Could you fix this?

bluenote-1577 commented 10 months ago

This is somewhat of a tricky issue...

The way the CLI is designed, -o only works for genomes because all genomes are grouped together, so they can all be renamed at once. There is no ambiguity.

But because sylph can sketch reads and genomes with the sketch option, it's not clear how -o should work for reads when genomes are also present. This is why -d is reserved for reads and -o for genomes.

In sylph v0.5, I am adding an option called --sample-names so that users can rename read sketch files to a list of sample names. This is probably what one wants for the -o option for reads.

If you have specific ideas on what -o should output for reads, let me know. For now, I will add a warning for when the user only uses -o for sketching reads.

fplazaonate commented 10 months ago

IMO, sylph sketch should process reads by considering they come from a single sample and generate a single sylsp file, no matter the number of fastq files provided. In this case, multiple fastq files would be multiple sequencing runs of the same library.

My lab and others generate multiple fastq files per sample to reach a target sequencing depth. Currently, sylph interface is not very convenient for that purpose. The solution is to extract all the files on the fly: -r <(zcat *.fastq.gz)

At the end, the output file as the name of the file descriptor (e.g: 63.sylsp) that has to renamed later.

bluenote-1577 commented 10 months ago

Hmm very interesting. Thanks for the input.

I think I will keep this format for now because most software I'm aware of only processes one read pair per sample. What you're saying makes sense, perhaps as an optional mode of input.

I will add an option for renaming in sylph v0.5 though.