IEDB / PEPMatch

Other
16 stars 1 forks source link

Batch kmer generation #17

Open InfiniGeorges opened 3 days ago

InfiniGeorges commented 3 days ago

https://github.com/IEDB/PEPMatch/blob/70d4ba9c7adb6d3dd5e20dfa55961f3299d26a7d/pepmatch/preprocessor.py#L186C1-L189C78

Dear Developer(s),

PepMatch is an incredibly useful and well-written tool—thank you for your hard work!

I've noticed that k-mer generation can lead to high memory usage with large databases. To improve this, I suggest using batch insertion and periodically clearing memory, as shown below. This can help manage memory more efficiently and avoid out-of-memory issues:

batch_size = 10000  # Adjust the batch size as needed
kmer_rows = []

for protein_count, seq in enumerate(self.all_seqs):
    for j, kmer in enumerate(split_sequence(seq, k)):
        kmer_rows.append((kmer, (protein_count + 1) * 1000000 + j))
        if len(kmer_rows) >= batch_size:
            cursor.executemany(f'INSERT INTO "{kmers_table}" VALUES (?, ?)', kmer_rows)
            kmer_rows.clear()  # Clear the list to start a new batch

# Insert any remaining rows that didn't make up a full batch
if kmer_rows:
    cursor.executemany(f'INSERT INTO "{kmers_table}" VALUES (?, ?)', kmer_rows)
danielmarrama commented 3 days ago

Hello, @InfiniGeorges, thank you very much for pointing this out! This definitely squashes the memory footprint and it runs in about the same amount of time. It passes my tests as well.

Since you suggested it, would you like to submit a PR?