MindAI / MiPepid

14 stars 5 forks source link

MiPepid

MiPepid is a software specifically for predicting the coding capabilities of sORFs.

The corresponding paper "MiPepid: Micropeptide identification tool using machine learning"[*] is now submitted to BMC Bioinformatics.

sORFs / smORFs are short open reading frames with length <= 303 bp (including the stop codon), and if translated, they encode micropeptides that are <= 100 amino acids.

Micropeptides were traditionally ignored due to their size but are now gaining increasing attentions because they have been shown to play critical roles in many vital biological activities.

What does MiPepid do?

Given a fasta file containing DNA fasta sequences, for each sequence, MiPepid will find all the sORFs (length <= 303 bp) present in all the 3 translation frames of the sequence, and for each sORF it will return the predicted class label (coding or noncoding) as well as the probability of being in that class. All the results will be written in an output .csv file.

Dependencies:

Language dependency: Python 3 (Please do not use Python 2 to run the code.)

Library dependency:

How to use:

cd MiPepid
python3 ./src/mipepid.py input_fasta_file_path output_fasta_file_path

Note:

The output .csv file contains the following columns:

How to run a demo

There is a sample DNA sequence file sample_seqs.fasta under the directory ./demo_files/. You can try to run MiPepid on this file:

cd MiPepid
python3 ./src/mipepid.py ./demo_files/sample_seqs.fasta ./demo_files/MiPepid_results_on_sample_seqs.csv

This will output a file MiPepid_results_on_sample_seqs.csv under the same directory (./demo_files/).

Regarding the datasets

datasets.tar.gz contains all major datasets used in the paper.



[]: Mengmeng Zhu, Michael Gribskov. MiPepid: Micropeptide identification tool using machine learning. BMC Bioinformatics* 20, 559 (2019) doi:10.1186/s12859-019-3033-9