Closed taylorreiter closed 5 months ago
We chatted offline, but just for the record, I looked into this bug using the files attached above in bug.tar.tz
, and I'm almost certain that this problem is due to an order-preservation bug in plm-utils: the order of the rows of the embeddings matrix generated by plmutils embed
command do not match the order of the sequences in the input fasta file. Since the embeddings matrix is used to generate the predictions, the result is that the predictions are not matched to the correct sequence IDs in the plmutils_predictions.csv
file output by the plmutils_predict
rule of the peptigate snakefile.
This bug is fixed in a PR in the plm-utils repo here.
Description of the bug
Over in #47, I noticed that if I ran
I got different sORF predictions than if I ran
I only got different results for sORF peptides. Below I've included the full output files, but here is a snippet of the differences (nt seqs are truncated): Short first:
Long first:
Command used and terminal output
Relevant files
bug.tar.gz
In this tar'd archive, I include the demo outputs that are "correct" (when the short contigs are concatenated first) and when they're "wrong" (when the long contigs are concatenated first).
These are the github links to the short and long contig files used as inputs for this run: https://github.com/Arcadia-Science/peptigate/blob/main/demo/contigs_longer_than_r2t_minimum_length.fa https://github.com/Arcadia-Science/peptigate/blob/main/demo/contigs_shorter_than_r2t_minimum_length.fa
System information
I ran peptigate on a Linux EC2. Compute specifications are reported here: https://github.com/Arcadia-Science/peptigate?tab=readme-ov-file#compute-specifications