gbgolding / crema

Classifying RNAs as lncRNAs by ensemble machine learning algorithms
10 stars 9 forks source link

Transcript features annotation using CPAT and DIAMOND #8

Open sebel76 opened 5 years ago

sebel76 commented 5 years ago

Hi,

First of all, thanks you for the nice program you design and wrote!

Observation: It seems that transcript features annotated using CPAT and DIAMOND identify ORF and annotate transcripts using the DIAMOND.

Questions:

  1. Can we retrieve (or extract) coordinate of ORF identify? Like a GTF/GFF annotated file.
  2. Can we retrieve reference protein for annotated ORFs previously identify?
    • I can change the DIAMOND -f option (-f 6 qseqid pident length qframe qstart qend sstart send evalue bitscore) to get the homolog reference protein.
    • Is this change can interfere with the next analysis step (predict.py)?

Question/Suggestion: Is possible to add the full annotation of submitted transcripts:

Thanks you for your support!

sebel76 commented 5 years ago

Precision:

My goal consist to annotate all unknown transcripts from my RNA-Seq experiments. I want to know how I can use data generated by your program to fill my goals.

Thanks Caitlin

caitsimop commented 5 years ago

Hi @sebel76,

Feel free to submit a pull request if you'd to implement this :) It seems like a useful addition but wasn't something I had originally intended to implement.

If you'd like to annotate your unknown transcripts I would suggest filtering through the final_ensemble_predictions.csv file to pull out the transcripts that you are interested in.

I'm afraid that CREMA will not work right if we change the DIAMOND option. However, it should be very quick to re-do the DIAMOND analysis with a substantially smaller smaller list of transcripts. This way you can change your DIAMOND options to fit your needs for your project.

sebel76 commented 5 years ago

Thanks you Caitlin!

I will follow your recommendation about filtering transcripts and re-do DIAMOND search. Again, thanks for your support!

Sébastien