RabbitBio / RabbitTClust

RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with MinHash sketches
Other
39 stars 3 forks source link

extract representative genomes/sequences #11

Open Jia-Xiu opened 3 months ago

Jia-Xiu commented 3 months ago

is it possible to add a function to extract representative genomes/sequences from the fasta file?

ZekunYin commented 3 months ago

Normally, this can be done by retrieving the input fasta file according to the clust result. Since our RabbitTClust is based on sketching algorithms, the similarity is calculated according to the sketches instead of the original sequences. Thus, the original sequences are not stored in memory. However, if we store the original sequences of the representative genome, in some circumstances we cannot guarantee a low memory footprint. For example, each clust only contains one sequence. But if this function is popular or urgently needed. I think maybe we can try to add it to a develop branch. Best, Zekun

Jia-Xiu commented 3 months ago

Dear Zekun,

Thanks for your reply. I used RabbitTClust to cluster viral contigs. So in my case, to get representative sequences is essential for my further analyses. I managed to extract representative sequences (the longest one in the cluster). Firstly, I used R to get a list of sequence names by "0" (global index of the genome) in the second column of the output file. And then extract representative sequences by the names using seqtk. But it is not a very elegant way. Since I cannot read the output file in a very good data.frame format in r by read.delim(), for example, "the cluster 0 is:" stand alone in the first column. That's why I ask if extracting representative sequences can be a feature in RabbitTClust. It's not an urgent need for me. If other users also request this function, I would be happy to see that you add it to the develop branch.

Best, Xiu

On Mon, 11 Mar 2024 at 07:02, ZekunYin @.***> wrote:

Normally, this can be done by retrieving the input fasta file according to the clust result. Since our RabbitTClust is based on sketching algorithms, the similarity is calculated according to the sketches instead of the original sequences. Thus, the original sequences are not stored in memory. However, if we store the original sequences of the representative genome, in some circumstances we cannot guarantee a low memory footprint. For example, each clust only contains one sequence. But if this function is popular or urgently needed. I think maybe we can try to add it to a develop branch. Best, Zekun

— Reply to this email directly, view it on GitHub https://github.com/RabbitBio/RabbitTClust/issues/11#issuecomment-1987687990, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGAJEJTQ3K5DEUGEIIVFCBLYXVCGVAVCNFSM6AAAAABELKHRZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGY4DOOJZGA . You are receiving this because you authored the thread.Message ID: @.***>