RabbitBio / RabbitTClust

RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with MinHash sketches
Other
39 stars 3 forks source link

Output file format #1

Closed jamesPet closed 1 year ago

jamesPet commented 1 year ago

Hello,

What's the format of the output file? In particular, the tab delimited values in lines not beginning with "the cluster"?

Thank you! jamie

XiaomingXu1995 commented 1 year ago

Hello, The output format is in a cd-hit output format. The lines beginning with tab delimiters are the genome information in each cluster. There is a little difference when running clust-mst and clust-greedy with different input options (-l and -i). Option -l means input as a FASTA file list, one file per genome, and option -i means input as a single FASTA file, one sequence per genome. From left to right, for both -l and -i options, the tab-delimited values are the local index in a cluster, global index, and genome length. For the -l option, the remaining values are the genome file name(including genome assembly accession number), the first sequence name in a genome file, and the rest are this sequence's comments. For the -i option, the remaining values are the sequence name and the comment of this sequence.

Best, Xiaoming Xu