Open GilEshel opened 3 years ago
Hi~
I also encountered the same problem, did you solve it?
Hi,
I'm sorry for the inconvenience. I wrote alfpy primarily for educational purposes and small datasets, so it doesn't perform well with large-scale data and long k-mers in protein sequences. While I do plan to rewrite alfpy to handle large datasets, I can't provide an estimated time for this update. In the meantime, I recommend using specialized tools for k-mer counting in protein sequences, such as MerCat2 or count_kmers, and then calculating the Canberra distance manually (e.g., with Numpy)
Again, I apologize for the trouble.
Best regards, Andrzej
Hello,
I want to use alfpy to count k-mers and calculate the Canberra distance for clustering proteomes of multiple species.
The input is usually >500K protein sequences, so I want to be able to run it in a reasonable time and balancing the memory load.
I used the following command:
calc_word.py --fasta combined.faa --word_size 8 --distance canberra --vector freqs --outfmt phylip --out combined_canberra.phy
And it was running on a single processor, and crashed after 7 min, indicating an out of memory error:
I was running the job on a 250GB node, and it used only 34GB:
A. Not sure why it crashed. B. Is there a way to run it more efficiently?
Many thanks, Gil