althonos / pyrodigal

Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
https://pyrodigal.readthedocs.org
GNU General Public License v3.0
138 stars 5 forks source link

Sequence index is not preserved in multithread mode #20

Closed zdk123 closed 1 year ago

zdk123 commented 1 year ago

Hit another bug in the output data.

Compare

records = SeqIO.parse(fasta_file, "fasta")
orf_finder = pyrodigal.OrfFinder(meta=True)
predictions = [orf_finder.find_genes(bytes(record.seq)) for record in records]
[p.__getstate__()['_num_seq'] for p in predictions]

output:

[1, 2, 3, 4 , ... ]
records = SeqIO.parse(fasta_file, "fasta")
orf_finder = pyrodigal.OrfFinder(meta=True)
with pool.ThreadPool() as p:
    predictions = p.map(lambda r: orf_finder.find_genes(bytes(r.seq)), records)
[p.__getstate__()['_num_seq'] for p in predictions]

output:

[2, 54, 66, 72, ...]

This causes a mismatch in the input sequence order and the ID in the resulting GFF / stats file. Anyway of fixing this rather than modifying the state?

althonos commented 1 year ago

This is not so much of a bug than it is an issue with how Python redirects the iterable inside ThreadPool.map i think; there is no guarantee in the order in which the threads receive the items. I think ThreadPool.imap might guarantee iteration order, but it's often slower. There's also the possibility that a first sequence takes longer than a second sequence, causing sequence number 2 to be returned first.

Since I'm going to change the ID field of the GFF output to become the gene identifier as you suggested in #18 anyway, I don't think there will be a reason to allow changing the _num_seq attribute manually. It shouldn't be used for anything else...

zdk123 commented 1 year ago

I agree with your interpretation, and the other fix will definitely solve this for us too. My only point I guess is that _num_seq doesn't get used until the results are written so it could be corrected manually.