Batch kmer generation - Githubissues

https://github.com/IEDB/PEPMatch/blob/70d4ba9c7adb6d3dd5e20dfa55961f3299d26a7d/pepmatch/preprocessor.py#L186C1-L189C78

Dear Developer(s),

PepMatch is an incredibly useful and well-written tool—thank you for your hard work!

I've noticed that k-mer generation can lead to high memory usage with large databases. To improve this, I suggest using batch insertion and periodically clearing memory, as shown below. This can help manage memory more efficiently and avoid out-of-memory issues:

batch_size = 10000  # Adjust the batch size as needed
kmer_rows = []

for protein_count, seq in enumerate(self.all_seqs):
    for j, kmer in enumerate(split_sequence(seq, k)):
        kmer_rows.append((kmer, (protein_count + 1) * 1000000 + j))
        if len(kmer_rows) >= batch_size:
            cursor.executemany(f'INSERT INTO "{kmers_table}" VALUES (?, ?)', kmer_rows)
            kmer_rows.clear()  # Clear the list to start a new batch

# Insert any remaining rows that didn't make up a full batch
if kmer_rows:
    cursor.executemany(f'INSERT INTO "{kmers_table}" VALUES (?, ?)', kmer_rows)

IEDB / PEPMatch

Batch kmer generation #17