StephenHwang / MEMO

MEMO: MEM-based pangenome indexing for k-mer queries
MIT License
10 stars 0 forks source link

MEMO doesn't scale well to large numbers of genomes #3

Open marade opened 1 month ago

marade commented 1 month ago

This is an incomplete list, but for example:

  1. "too many open files" errors occur in Bash for large numbers of genomes. At a minimum adjusting ulimit for these might be a good idea, but it would be better if it was handled in Python and not so many files were opened in the first place.

  2. Code like this does not take large numbers of genomes into account, causing "ValueError: invalid literal for int() with base 10" because the 1010111... number is too long for int on each line:

# plot_conservation.py line 52 num_docs_per_pos = list(map(int, list(fileReader(path))))

StephenHwang commented 3 weeks ago

Thank you for bringing up these issues! I'll keep this in mind and see what I can do to make MEMO more scalable.

marade commented 3 weeks ago

I'm not bringing them up just to complain. The number of available genomes is increasing at a fast rate, so we really need tools like this one to scale well. I appreciate your work.