PNNL-CompBio / Snekmer

Pipeline to apply encoded Kmer analysis to protein sequences
BSD 3-Clause "New" or "Revised" License
12 stars 1 forks source link

Speed issue #77

Closed biodataganache closed 2 years ago

biodataganache commented 2 years ago

For even moderately sized input files (5k, e.g.) kmerize is taking a long time (hour+) which is way too long. The problem was introduced by the previous fix for the memory issue, and it's in the vectorize.py/make_feature_matrix function, which is using a very slow way of constructing a matrix from individual lists.