eblerjana / pangenie

Pangenome-based genome inference
MIT License
107 stars 10 forks source link

pangenie-index failed: std::runtime_error, "Hash full" #74

Closed leleory closed 5 months ago

leleory commented 6 months ago

Hi Jana,

I was in the process of indexing a cactus-pangenome based variant callset (seven diploids). The variants were filtered with vcfbub. I tested pangenie by running it with data on a single chromosome and it worked OK.

When I tried it on the full dataset the indexing step failed with the following lines:

pangenie-index -r $infa -v $invcf -o idx -e 100000

GraphBuilder: skip variant at sscro11_Y:42522731 since alleles contain undefined nucleotides: T,TCTCTCTCTCTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCTCTCTCT GraphBuilder: skip variant at sscro11_Y:43174887 since alleles contain undefined nucleotides: CAACACAAAACCAGGTTCAAGGAATATTGAAAGGGCTTCTCTAAACCAAAAAGTAAGGAAGGAAAGGAAGAAAAAGAAAAGAAAAAAAAAAGAAGAAGAAGAAGAGGAAGAACTAGGACTGAGGAAACCGCAATCAGAGAGCAGTCACTC,CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN Identified 21347831 variants in total from VCF-file. Write path segments to file ... Found 347 chromosome(s) in the VCF. Count kmers in graph ... terminate called after throwing an instance of 'std::runtime_error' what(): Hash full

Can you please tell me what does the error "Hash full" means and how can I overcome this problem?

Thank you, Lel

eblerjana commented 5 months ago

Hi Lel,

I'd assume this is either because not enough memory is available on the machine or due to the -e 100000 parameter which sets the hash size. For a whole genome dataset, I would not recommend using this parameter, this is mainly meant to be used for demo purposes (see README).

Best, Jana

leleory commented 5 months ago

This should then be the -e parameter, as only half of the allocated memory was used by pangenie-index. Thank you, Lel

leleory commented 5 months ago

Thanks, Jana! Problem was solved after -e 100000 was left out. Lel