Arcadia-Science / peptigate

Peptigate ("peptide" + "investigate") predicts bioactive peptides from transcriptome assemblies or sets of proteins.
MIT License
1 stars 1 forks source link

Weird sORF results when changing the order of the input contigs file #50

Closed taylorreiter closed 5 months ago

taylorreiter commented 5 months ago

Description of the bug

Over in #47, I noticed that if I ran

cat shorter_contigs.fa longer_contigs.fa > contigs.fa

I got different sORF predictions than if I ran

cat longer_contigs.fa shorter_contigs.fa > contigs.fa

I only got different results for sORF peptides. Below I've included the full output files, but here is a snippet of the differences (nt seqs are truncated): Short first:

peptide_id      start   end     peptide_type    peptide_class   prediction_tool nlpprecursor_class_score        nlpprecursor_cleavage_score     protein_sequence        nucleotide_sequence
petx0wholefemale_NODE_618917_length_73_cov_2.401639_g519643_i0  NA      NA      sORF    NA      plmutils        NA      NA      MPTLSATAMFTQFTLPL       ATGCCAACCTTGAGTGCTACTGCCATGTTTACACAGTTTACCTTGCCA>
petx0wholefemale_NODE_618920_length_72_cov_3.090909_g519646_i0  NA      NA      sORF    NA      plmutils        NA      NA      LTLRAKRPKDTTNQTS        CTGACTTTGCGAGCCAAACGACCAAAAGACACAACTAACCAAACAAGC
Transcript_10   NA      NA      sORF    NA      plmutils        NA      NA      VSGEPVAKHKGLASFLELYCENCAFPEKVISRAYTSWRVTAGKDESKSAARAYDSGSSCESFTVNVKAVVVARSFGIRYQQLMVQEVSGDSGFFYG        GTGTCTGGGGAACCGG>
Transcript_30   NA      NA      sORF    NA      plmutils        NA      NA      MGSSAVSLAGRRRRMKAILYSSIALSSLLLMLSQSNLQNRVRLLYFFSELCSVGYFLLVGSTVKKMH     ATGGGAAGTTCTGCCGTGAGCCTCGCCGGCAGAAGGCGTCGGATGAAA>
Transcript_31   NA      NA      sORF    NA      plmutils        NA      NA      MFVQWRTQNSNNSSSVDCCDVWDVNLSRKCVIYACKYFKNRLRTWFSAMKMFNVVAGGQVSKLLEIIHYYSRLQLRNEAANVPQRSPSSQWS    ATGTTTGTGCAGTGGCGGACACAA>
Transcript_42   NA      NA      sORF    NA      plmutils        NA      NA      MHFYTGRVDLALVFGVDFSHSEMPKFPSWLGAGLEVVGELCRRCVPMARLAFGNRAGFSSPVCWAASMSVQSAM      ATGCACTTTTATACAGGGAGAGTAGATCTTGCTTTAGTCT>
Transcript_45   NA      NA      sORF    NA      plmutils        NA      NA      LGCILLLLLTLLLFLCDYTRECVVDINMCIKDTARTMAVVCSCFENAVWCAVLVECFWYVLDICGVLCARSGIATEGSRLAGWLEDEDDVSSFSWKNS      CTGGGATGTATCTTGT>
Transcript_53   NA      NA      sORF    NA      plmutils        NA      NA      LHGGACKPEILLTQYGPFMSYQGNTKTESRLIRMFCVGACGSGNCKNKEIAPKCCCVPLLHKSLATNFFPATVCVRLLLVLRLFFAVYFLLTNFL TTGCATGGCGGCGCCTGCAAGCCG>
Transcript_56   NA      NA      sORF    NA      plmutils        NA      NA      VSQRCVLCLLFFVSSVALLWVMISETKVVVSAGYCNLVRSTYTILLAPCSLRHLLRTTFRRTSPQRTLHSLKNAVAVTT GTGAGTCAGCGCTGTGTGTTGTGTCTCCTCTTCTTTGTCT>
Transcript_57   NA      NA      sORF    NA      plmutils        NA      NA      LSAEMNGPNLSDEYAASVLPLFPTGTAFKNSSLLRVGRSIELYVSSLGAPEFFVSARISFLRAFVIEGFKCSELVTVTSAK       TTGTCGGCTGAAATGAACGGCCCCAACTTGTC>
Transcript_58   NA      NA      sORF    NA      plmutils        NA      NA      LGAMRLLLTDDSHHYRTLEPILLFRHIAACFRGSFDPFYTHFTPILPKGPSLCALAMGMATTKPLQ      TTGGGAGCGATGCGCCTGCTGCTTACCGATGATTCGCATCACTACAGA>
Transcript_66   NA      NA      sORF    NA      plmutils        NA      NA      LPHFFFATEAEGANQERRHCHSHAIYSYRARLLVKHKSSLVVPSSRIKKLGIPLCHA       CTGCCACATTTTTTTTTTGCAACTGAAGCGGAAGGTGCCAACCAGGAGCGCCGTCA>
Transcript_67   NA      NA      sORF    NA      plmutils        NA      NA      MGYLTAQCAIWLEICVQFFTQASCSMNMLECDCFSYAFEDPSKHTCTLYDVKQHTKGHMLALSLLMYTCVSAISSLLSILWLPSIT  ATGGGATACTTAACTGCACAGTGCGCTATTTG>
Transcript_80   NA      NA      sORF    NA      plmutils        NA      NA      LIQCLRTYSVWTHGRKARPYLEERNSYMRMSKLNASCFIILRHTVVMETRKLSLHLQRGTKSTKP       TTGATACAATGTCTCAGAACATACTCCGTCTGGACGCACGGTCGGAAG>
Transcript_81   NA      NA      sORF    NA      plmutils        NA      NA      LVQRTNNSQLNSRHCCLSCTFTQVQGLHSSFHAQPFLFGQMDKNAVTLINRQALYIKEVFFK  CTGGTTCAAAGGACAAATAACAGTCAGCTAAACAGCCGGCACTGCTGTCTGTCGTG

Long first:

peptide_id      start   end     peptide_type    peptide_class   prediction_tool nlpprecursor_class_score        nlpprecursor_cleavage_score     protein_sequence        nucleotide_sequence
Transcript_53   NA      NA      sORF    NA      plmutils        NA      NA      LHGGACKPEILLTQYGPFMSYQGNTKTESRLIRMFCVGACGSGNCKNKEIAPKCCCVPLLHKSLATNFFPATVCVRLLLVLRLFFAVYFLLTNFL TTGCATGGCGGCGCCTGCAAGCCG>
Transcript_54   NA      NA      sORF    NA      plmutils        NA      NA      VRHEKTEISSPLLHSLSFWLLRKAGFSPIMNNNHEAVVISAFLHASHDRKTHRPSQPSFSY   GTGAGGCATGAAAAAACTGAAATATCATCTCCGCTTCTACATTCGTTGTCATTCTG>
Transcript_67   NA      NA      sORF    NA      plmutils        NA      NA      MGYLTAQCAIWLEICVQFFTQASCSMNMLECDCFSYAFEDPSKHTCTLYDVKQHTKGHMLALSLLMYTCVSAISSLLSILWLPSIT  ATGGGATACTTAACTGCACAGTGCGCTATTTG>
Transcript_79   NA      NA      sORF    NA      plmutils        NA      NA      LWRIIIAAQFSLKSGDHCLVFHQLRLLRCETVPEFFFFAVIQLFVLRIVQIFFSVLEVLINVVSVH      TTGTGGAGAATCATCATTGCTGCACAGTTTTCTCTTAAATCAGGCGAC>
Transcript_80   NA      NA      sORF    NA      plmutils        NA      NA      LIQCLRTYSVWTHGRKARPYLEERNSYMRMSKLNASCFIILRHTVVMETRKLSLHLQRGTKSTKP       TTGATACAATGTCTCAGAACATACTCCGTCTGGACGCACGGTCGGAAG>
Transcript_83   NA      NA      sORF    NA      plmutils        NA      NA      MWKLNNTLLRDDVYYRAVKDEIGKINPCKNLKIWQQWELSKESLKIKAIERATCIRYKEKNEAELRALLETLLKQECKEPRKWI    ATGTGGAAGCTAAACAACACGCTTCTTCGCGA>
Transcript_85   NA      NA      sORF    NA      plmutils        NA      NA      LVLRLRGGAKKRKKKNYSTPKKIKHKRKKVKLAVLKYYKVDENGKIHRLRRECTSESCGAGVFMAAHEDRHYCGKCHLTLVYSKQEDK        CTGGTGCTTCGCCTGCGCGGTGGC>
Transcript_88   NA      NA      sORF    NA      plmutils        NA      NA      VPLFKAPSDNVVLEKWRRAIPRADRTLMPTDHVCAKHFAEDAISRAYYAELDKSATLRGRNARAFQRCSSYITVADG   GTGCCATTATTCAAAGCTCCGTCCGACAATGTTGTTTTGG>
Transcript_90   NA      NA      sORF    NA      plmutils        NA      NA      LTVALPTSHLLNGILCLLSSLAGVGKQPSEVYHICHLSRLQHRVFSTVTPT     TTGACAGTAGCATTACCCACTTCTCATTTATTAAACGGCATTCTGTGCCTTCTTAGTTCTCTTG>
Transcript_91   NA      NA      sORF    NA      plmutils        NA      NA      TRTNGSPSSLKPRIIGRNFRYSIYTLQLKLHAVTSAALKTITHG    ACGCGTACTAATGGAAGCCCGAGTTCACTGAAACCCCGCATAATTGGGAGGAACTTTCGGTATTCCATTTAT>
Transcript_92   NA      NA      sORF    NA      plmutils        NA      NA      MLSNRKCVYTNMFTADGIYLQPVPLLSIRGACCTTGDCSISDVWAAYHHSVLAVCITQLTHILRPANHLNPILHNGPTRSFAAVYNR ATGCTGAGTAACAGAAAATGCGTTTATACAAA>
Transcript_96   NA      NA      sORF    NA      plmutils        NA      NA      LKCATSAHLKLKKRNIADACFPHALKKGFLEKYNDNVNLQAVRLQSSGYPILFFGSVVENRLQHIV      TTGAAATGCGCTACCTCAGCGCATTTAAAGCTGAAAAAACGCAACATT>
Transcript_98   NA      NA      sORF    NA      plmutils        NA      NA      LIAHSRDPPCSRSRSFKQRSDQCRCVRMTKVFHKPRFSHISRPLRCSLLN      CTGATAGCGCACTCCAGGGATCCCCCGTGCTCTCGAAGCCGTTCATTCAAACAGCGCTCCGACC>
petx0wholefemale_NODE_618928_length_70_cov_1.512605_g519654_i0  NA      NA      sORF    NA      plmutils        NA      NA      THLWSISSYRCHTTTRQYF     ACGCACCTTTGGTCAATTTCATCCTATCGCTGTCACACGACGACACGA>
petx0wholefemale_NODE_618932_length_68_cov_3.153846_g519658_i0  NA      NA      sORF    NA      plmutils        NA      NA      LPPCSAFFLSLFNCVVNY      TTGCCGCCATGCTCGGCTTTTTTTTTATCTTTGTTTAACTGTGTGGT

Command used and terminal output

cat contigs_shorter_than_r2t_minimum_length.fa contigs_longer_than_r2t_minimum_length.fa > contigs.fa
snakemake --software-deployment-method conda -j 8 -k --configfile demo/config.yml

cat  contigs_longer_than_r2t_minimum_length.fa contigs_shorter_than_r2t_minimum_length.fa > contigs.fa
snakemake --software-deployment-method conda -j 8 -k --configfile demo/config.yml

Relevant files

bug.tar.gz

In this tar'd archive, I include the demo outputs that are "correct" (when the short contigs are concatenated first) and when they're "wrong" (when the long contigs are concatenated first).

These are the github links to the short and long contig files used as inputs for this run: https://github.com/Arcadia-Science/peptigate/blob/main/demo/contigs_longer_than_r2t_minimum_length.fa https://github.com/Arcadia-Science/peptigate/blob/main/demo/contigs_shorter_than_r2t_minimum_length.fa

System information

I ran peptigate on a Linux EC2. Compute specifications are reported here: https://github.com/Arcadia-Science/peptigate?tab=readme-ov-file#compute-specifications

keithchev commented 5 months ago

We chatted offline, but just for the record, I looked into this bug using the files attached above in bug.tar.tz, and I'm almost certain that this problem is due to an order-preservation bug in plm-utils: the order of the rows of the embeddings matrix generated by plmutils embed command do not match the order of the sequences in the input fasta file. Since the embeddings matrix is used to generate the predictions, the result is that the predictions are not matched to the correct sequence IDs in the plmutils_predictions.csv file output by the plmutils_predict rule of the peptigate snakefile.

This bug is fixed in a PR in the plm-utils repo here.