algbio / ggcat

Compacted and colored de Bruijn graph construction and querying
MIT License
73 stars 10 forks source link

Coloring options for an input file list #17

Open tmaklin opened 1 year ago

tmaklin commented 1 year ago

Hi,

Would it be possible to add an option to supply a list of colors to use when building a colored DBG from an input file list? The current method of using the -l argument seems to give each sequence in the list its own color, but in some cases it would be desirable to color several sequences with the same color (for example according to some taxonomy).

Currently this can be done by concatenating the files that should get the same color into separate fasta files but this is quite cumbersome to setup for large inputs with many colors and requires duplicating the entire dataset in the temporary fasta files, which can get pretty large.

Example of input I would like to use (first column is colors, second is the file path):

color-0    GCF_000160075.2_ASM16007v2_genomic.fna.gz
color-0    GCF_013267415.1_ASM1326741v1_genomic.fna.gz
color-1    GCF_000963925.1_ASM96392v1_genomic.fna.gz
color-1    GCF_002153775.1_ASM215377v1_genomic.fna.gz
color-1    GCF_007989305.1_ASM798930v1_genomic.fna.gz

where the colors and the sequence paths could be supplied either in the same file as above or as separate files.

Thanks for creating ggcat!

aryakaul commented 1 year ago

Hello, just jumping on this thread as I am also interested in this option! Have you found a solution beyond concatenating files @tmaklin ?

tmaklin commented 1 year ago

hi, no I haven't unfortunately.

Guilucand commented 1 year ago

Hi, sorry for the late response, I just found an easy way of adapting the ggcat code to add this feature. I pushed the code on the dev branch if you want to test it, it works by passing a new flag -d to the tool

aryakaul commented 1 year ago

Much appreciated, thank you! I'll aim to try it out sometime early next week!