Classifying transcripts as lncRNAs is difficult because there is no consensus on what a lncRNA truly is. Instead of using thresholds, or rules, for identifying lncRNAs, this tool uses an ensemble stacking method of 8 different gradient boostling models to predict lncRNAs. Trained only on true, validated lncRNAs, this method has been tested on plant transcriptomes with a high success rate.
See our publication at: Prediction of plant lncRNA by ensemble machine learning classifiers
If you use CREMA, please cite: Simopoulos CMA, Weretilnyk EA, Golding GB. "Prediction of plant lncRNA by ensemble machine learning classifiers." BMC Genomics (2018) doi:110.1186/s12864-018-4665-2
To use this tool, simply clone this repository on your machine by:
git clone https://github.com/gbgolding/crema.git
To use this tool you will need:
Before you can run tool, you'll need to remove all rRNAs and tRNAs from your input data.
Then, you will need to run cpat.py. An example:
cpat.py -g your_transcript_fasta_file.fa -o cpat_output.txt -x ./cpat_models/ath_hexamer -d ./cpat_models/ath_logit.RData
Firstly you must create the DIAMOND database from the SwissProt protein database:
diamond makedb --in uniprot_sprot.fasta -d swissprot.dmnd
Run DIAMOND:
diamond blastx -d swissprot.dmnd -q your_transcript_fasta_file.fa -o diamond_output.txt \\
-e 0.001 -k 5 --matrix BLOSUM62 --gapopen 11 --gapextend 1 --more-sensitive \\
-f 6 qseqid pident length qframe qstart qend sstart send evalue bitscore
Once you have identified your transcript features using CPAT and DIAMOND, you can run the tool!
python3 bin/predict.py -f your_transcript_fasta_file.fa -c cpat_output.txt -d diamond_output.txt
Note: if script cannot find logit_models.RData
, please run predict.py
using its full file path. This is a known issue that we are working on solving.
All output files are written to your working directory. Custom output directories to come...
The most helpful output file is final_ensemble_predictions.csv
.
The CSV has outputs of both the features used in prediction as well as the lncRNA prediction score and final decision.
The columns describe:
The other files may be less useful to you, depending on what you're looking at.
all_model_predictions.csv
: how each base model predicted the transcript (1 == lncRNA).
all_model_scores.csv
: the lncRNA prediction scores of each transcript for each base model.
ensemble_logreg_pred.csv
: the raw output of the final logistic regression stacking classifier.
Required arguments:
-f input fasta file
-c output file from CPAT run
-d output file from Diamond blastx
Optional arguments:
-s minimum lncRNA prediction score (Default: 0.5)