Running calc_word.py effectively on a large protein fasta file

GilEshel commented 3 years ago

Hello,

I want to use alfpy to count k-mers and calculate the Canberra distance for clustering proteomes of multiple species.

The input is usually >500K protein sequences, so I want to be able to run it in a reasonable time and balancing the memory load.

I used the following command: calc_word.py --fasta combined.faa --word_size 8 --distance canberra --vector freqs --outfmt phylip --out combined_canberra.phy

And it was running on a single processor, and crashed after 7 min, indicating an out of memory error:

Traceback (most recent call last):
  File "/home/ge30/anaconda_ete/bin/calc_word.py", line 177, in <module>
    main()
  File "/home/ge30/anaconda_ete/bin/calc_word.py", line 152, in main
    vec = veccls[args.vector](seq_records.length_list, p)
  File "/home/ge30/anaconda_ete/lib/python3.6/site-packages/alfpy/word_vector.py", line 103, in __init__
    Counts.__init__(self, seq_lengths, patterns)
  File "/home/ge30/anaconda_ete/lib/python3.6/site-packages/alfpy/word_vector.py", line 41, in __init__
    self.data = self._get_counts_occurrence(len(seq_lengths), patterns)
  File "/home/ge30/anaconda_ete/lib/python3.6/site-packages/alfpy/word_vector.py", line 52, in _get_counts_occurrence
    data = np.empty((seq_count, patterns.count))
MemoryError

I was running the job on a 250GB node, and it used only 34GB:

       JobID    Elapsed      NCPUS   NTasks      State     ReqMem  AveVMSize  MaxVMSize     MaxRSS 
------------ ---------- ---------- -------- ---------- ---------- ---------- ---------- ---------- 
14131079       00:07:05         20           COMPLETED      250Gn                                  
14131079.ba+   00:07:05         20        1  COMPLETED      250Gn  34523968K  34523968K  34070716K 
14131079.ex+   00:07:05         20        1  COMPLETED      250Gn    724756K    584412K      6676K

A. Not sure why it crashed. B. Is there a way to run it more efficiently?

Many thanks, Gil

cocoMA2020 commented 1 month ago

Hi~
I also encountered the same problem, did you solve it?

aziele commented 1 month ago

Hi,

I'm sorry for the inconvenience. I wrote alfpy primarily for educational purposes and small datasets, so it doesn't perform well with large-scale data and long k-mers in protein sequences. While I do plan to rewrite alfpy to handle large datasets, I can't provide an estimated time for this update. In the meantime, I recommend using specialized tools for k-mer counting in protein sequences, such as MerCat2 or count_kmers, and then calculating the Canberra distance manually (e.g., with Numpy)

Again, I apologize for the trouble.

Best regards, Andrzej

aziele / alfpy

Running calc_word.py effectively on a large protein fasta file #3