Xinglab / espresso

Other
57 stars 4 forks source link

Handling ‘NA’ Gene IDs and Annotation of GTF Files in ESPRESSO #42

Open yycc9897 opened 10 months ago

yycc9897 commented 10 months ago

Hello, Thank you for developing ESPRESSO! As a newcomer to this field, I am seeking some guidance. I am working with nanopore sequencing data from a local pig breed. The reference genome annotation for this breed is not complete. I used the Espresso software and discovered a total of 29,494 transcripts, of which 4,924 have gene IDs labeled as ‘NA’.I am unsure how to proceed with these ‘NA’ gene ID transcripts. Could you provide some advice on this? Additionally, I am considering using software like StringTie2 or FLAIR to annotate the GTF file prior to running Espresso. Would this be a beneficial step, or is it unnecessary? I greatly appreciate any advice or suggestions you can provide. Best wishes.

EricKutschera commented 10 months ago

For transcripts that are not in the GTF, ESPRESSO will try to find a gene ID by looking for any splice junctions in that transcript which are also in some transcript from the GTF. If ESPRESSO doesn't find a shared splice junction then it will use NA as the gene ID. For those transcripts without a gene ID you can check to see if the coordinates are nearby anything in the GTF

Generating a GTF with another tool and then giving that GTF to ESPRESSO might help. ESPRESSO doesn't require a GTF file, but if one is provided it will treat the transcripts and splice junctions in that file as high confidence