COMBINE-lab / pufferfish

An efficient index for the colored, compacted, de Bruijn graph
GNU General Public License v3.0
107 stars 19 forks source link

Pufferfish index decoy option fixFasta doesn't recognize FASTA header(s) and fails #37

Closed hermidalc closed 2 years ago

hermidalc commented 2 years ago

I tried to load the latest GENCODE human genome FASTA as a decoy into pufferfish index and the fixFasta component immediately give the following error:

[2022-08-19 14:27:01.875] [puff::index::jointLog] [info] Running fixFasta
[2022-08-19 14:27:01.875] [puff::index::jointLog] [critical] The decoy name NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN was encountered more than once --- please ensure all decoy names and sequences are unique.
[2022-08-19 14:27:01.875] [puff::index::jointLog] [error] The fixFasta phase failed with exit code 1

Even though the genome file head looks like this:

$ head -10 GRCh38.primary_assembly.genome.fa
>chr1 1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

I first thought it was the leading Ns but even after clipping all leading and trailing Ns in all sequence entries it still gives the same error with the first sequence. It appears that when processing the decoy FASTA file fixFasta has a bug, because it doesn't notice the > header(s).

rob-p commented 2 years ago

Hi @hermidalc,

The decoy sequence itself should be included at the end of the normal input reference fasta. The decoy file instead contains the names of the decoy records in the reference file. The error you are seeing is because its interpreting the fasta header as a name and that record isn't found in the normal reference file.

hermidalc commented 2 years ago

Ok thank you @rob-p that makes sense! Closing