Evolinc / Evolinc-I

2 stars 3 forks source link

Error in parsing transcripts #20

Closed KristinaGagalova closed 3 years ago

KristinaGagalova commented 3 years ago

Hi, I was able to run Evolinc with the test data but now I am getting an error when using it on braker genome annotations. This is the error message

Tue Mar 9 17:39:39 UTC 2021
No fasta index found for referencegenome.fa. Rebuilding, please wait..
Fasta index rebuilt.
Generating Number of transcripts
##################################
grep: transcripts.*.fa: No such file or directory
transcripts.*.fa 
##################################
cat: transcripts.*.filter.fa: No such file or directory
[INFO] read file 'transcripts.all.overlapping.filter.fa'
[INFO] Predicting coding potential, please wait ...
[INFO] Running Done!
[INFO] cost time: 0s
[ERROR] putative_intergenic.genes.fa is not a file
grep: putative_intergenic.genes_cpc2.txt: No such file or directory
Can't open putative_intergenic.genes.fa: No such file or directory.
Generating Number of coding and noncoding
##################################
grep: putative_intergenic.genes_cpc2.txt: No such file or directory
putative_intergenic_coding_transcripts
grep: putative_intergenic.genes_cpc2.txt: No such file or directory
putative_intergenic_noncoding_transcripts
overlapping_coding_transcripts 1
overlapping_coding_transcripts 0

Looks like it's not able to extract the transcript sequences and run transdecoder correctly? This is the format og my gtf file

CsWA_scaf115    AUGUSTUS        gene    1563351 1564313 .       -       .       jg29579
CsWA_scaf115    AUGUSTUS        transcript      1563351 1564313 .       -       .       transcript_id "jg29579.t1"; gene_id "jg29579"
CsWA_scaf115    AUGUSTUS        stop_codon      1563351 1563353 .       -       0       transcript_id "jg29579.t1"; gene_id "jg29579";
CsWA_scaf115    AUGUSTUS        CDS     1563351 1564313 0.88    -       0       transcript_id "jg29579.t1"; gene_id "jg29579";
CsWA_scaf115    AUGUSTUS        exon    1563351 1564313 .       -       .       transcript_id "jg29579.t1"; gene_id "jg29579";
CsWA_scaf115    AUGUSTUS        start_codon     1564311 1564313 .       -       0       transcript_id "jg29579.t1"; gene_id "jg29579";
CsWA_chr04      AUGUSTUS        gene    6431667 6433016 .       +       .       jg761
CsWA_chr04      AUGUSTUS        transcript      6431667 6433016 .       +       .       transcript_id "jg761.t1"; gene_id "jg761"
CsWA_chr04      AUGUSTUS        start_codon     6431667 6431669 .       +       0       transcript_id "jg761.t1"; gene_id "jg761";
CsWA_chr04      AUGUSTUS        CDS     6431667 6433016 0.94    +       0       transcript_id "jg761.t1"; gene_id "jg761";
CsWA_chr04      AUGUSTUS        exon    6431667 6433016 .       +       .       transcript_id "jg761.t1"; gene_id "jg761";
CsWA_chr04      AUGUSTUS        stop_codon      6433014 6433016 .       +       0       transcript_id "jg761.t1"; gene_id "jg761";
CsWA_scaf115    AUGUSTUS        gene    4180987 4181720 .       +       .       jg31437
CsWA_scaf115    AUGUSTUS        transcript      4180987 4181720 .       +       .       transcript_id "jg31437.t1"; gene_id "jg31437"
CsWA_scaf115    AUGUSTUS        start_codon     4180987 4180989 .       +       0       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        CDS     4180987 4181063 0.59    +       0       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        exon    4180987 4181063 .       +       .       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        intron  4181064 4181137 .       +       .       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        CDS     4181138 4181720 0.54    +       1       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        exon    4181138 4181720 .       +       .       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        stop_codon      4181718 4181720 .       +       0       transcript_id "jg31437.t1"; gene_id "jg31437";

Is there anything wrong with that? Thank you in advance

KristinaGagalova commented 3 years ago

I know what the issue is: Imy input files do not have the "u" flag. How exactly do you generate the gtf files to use in cuffcompare/cuffmerge?

andrew-d-l-nelson commented 3 years ago

Hi Kristina, Typically we encourage users to use cuffmerge to generate the final GTF that is going into Evolinc. If you have multiple GTFs, then you can use the "merge_multiple_gtfs" shell script in our accessory scripts. Are you analyzing RNA-seq data or are you running this from a genome annotation prediction pipeline?

KristinaGagalova commented 3 years ago

Hi Andrew, I am running the pipeline on my genome annotation (reference) using RNA-seq data that I included as evidence. This is the way I figure out (let me know if there is something wrong with that):

  1. I run cuffcompare with my reference annotation (-r) and the list of RNAseq samples that I have.
  2. That gave me the regions that are annotated as "untranslated"
  3. After that I normally run Evolinc Just one question: I've noticed that you have a separate output for the "other" lincRNAs (AOT and SOT). Are those more difficult to predict so that's why you've separated them? Any considerations on this dataset? Thank you in advance
andrew-d-l-nelson commented 3 years ago

This looks good Kristina. Let me know if you still have issues after running this pipeline. FYI, the sister workflow, RMTA, can help resolve many of the common issues getting prerequisites in the right format.

Regarding the AOT/SOT transcripts, we pull them out for a few reasons 1) lack of strandedness support in the RNA-seq data itself, 2) how often antisense transcripts perfectly overlap (in the antisense direction) with the exons of their overlapping protein-coding gene (which often indicates poor strand prediction by the assembly algorithm), and 3) our inability to distinguish sequence conservation of the lncRNA from its overlapping gene.

That doesn't mean that those "lncRNAs" are uninformative or not real - they may be, I just typically have less confidence in them (unless they are coming from IsoSeq or ONT sequencing).

KristinaGagalova commented 3 years ago

Thank you for the help and the feedback! I have successfully run Evolinc and included the annotation in my pipeline. I am a happy user of the software