algbio / ggcat

Compacted and colored de Bruijn graph construction and querying
MIT License
72 stars 10 forks source link

Graph format issue (?) #16

Closed pierrepeterlongo closed 1 year ago

pierrepeterlongo commented 1 year ago

Hello,

I'm trying a few tests with ggcat and I'm having an issue at query time. Here are the used commands:

ggcat build -c -l fof -k 25 -s 1 -o index_first_3_humans

With fof linked to these files downloaded from your zenodo repository :

HG00096.fa
HG00097.fa
HG00099.fa

The computation takes 45 minutes and creates these files:

5313314765 Jan 16 15:54 index_first_3_humans
       181 Jan 16 15:38 index_first_3_humans.colors.dat
   2050740 Jan 16 15:54 index_first_3_humans.stats.log
       189 Jan 16 17:10 output.stats.log

I query the created graph with this command:

ggcat query --colors -k 25  -j 16  index_first_3_humans ../query_reads/head_D3_S1_L001_R1_001.fasta 

It ends quickly with

Thread panicked at location: /scratch/ppeterlo/ggcat/pipeline/common/io/src/sequences_reader.rs:82:21
Error message: Cannot recognize file type of 'index_first_3_humans'

Any idea ? Am I doing something wrong? Thanks ! Pierre

Guilucand commented 1 year ago

Hi Pierre,

the problem you're having is due to the fact that the parsing function of the query reference graph uses the file extension to determine the file type, and the build phase at the moment does not append an extension automatically.

To fix it you can rename the file index_first_3_humans to index_first_3_humans.fa and the colormap file from index_first_3_humans.colors.dat to index_first_3_humans.fa.colors.dat.

To avoid this kind of problems in the future, you can pass an extension directly in the build phase like this:

ggcat build -c -l fof -k 25 -s 1 -o index_first_3_humans.fa

Also it's better to always specify explicitly the number of threads to use (-j flag), as now automatically defaults to 16 threads.

Regards, Andrea

pierrepeterlongo commented 1 year ago

Perfect, thanks!

shenwei356 commented 10 months ago

Oh, same problem here. Solved by adding a file extension .fasta.