algbio / ggcat

Compacted and colored de Bruijn graph construction and querying
MIT License
72 stars 10 forks source link

Error 68 : Unfinished stream #42

Closed adamant-pwn closed 6 months ago

adamant-pwn commented 7 months ago

Hi there! In https://github.com/jermp/sshash/pull/39, @jermp suggested that I use ggcat to produce input datasets for sshash.

I tried using ggcat, but unfortunately something seems wrong:

$ gzip -d se.ust.k31.fa.gz 
$ ggcat build -k 31 -j 8 --eulertigs se.ust.k31.fa 
...
Final output saved to: output.fasta.lz4
$ lz4 -d output.fasta.lz4 
Decoding file output.fasta 
Error 68 : Unfinished stream 

This is using se.ust.k31.fa.gz as an input dataset for ggcat. Ultimately, I want to apply ggcat to compute eulertigs of Homo_sapiens.GRCh38.dna.toplevel.fa.gz for k=127, but with that dataset I too end up having Unfinished stream errors, and the resulting file is much smaller than I anticipate. Could you please advice if I'm doing anything wrong here?

Guilucand commented 6 months ago

Hi! This problem is due to 2 different things:

I just fixed the second problem, but probably all you want is to pass the flag -s 1 to ggcat to lower the cutoff.

enricorox commented 4 months ago

Same problem here. It would be nice to print a warning in case the fasta file is empty.