ababaian / LIONS

LIONS is a bioinformatic analysis pipeline which brings together a few pieces of software and some home-brewed scripts to annotate a paired-end RNAseq library to detect TE-intiated transcripts
GNU General Public License v3.0
27 stars 13 forks source link

custom annotation library #11

Closed Alex-Nesta closed 5 years ago

Alex-Nesta commented 5 years ago

Is it possible to use a custom annotation library? I.E. gencode v29 instead of the UCSC Refseq table?

can I convert using something like gtfToGenePred? http://hgdownload.soe.ucsc.edu/admin/exe/

looking forward to testing your tool on my own data!

ababaian commented 5 years ago

Hey @Alex-Nesta,

You can use any annotation library, the only hitch is that it will need to be formatted as a UCSC table output. If you're interested in using gencode then the easy solution is that it is already available at the UCSC Genome Table Browser.

For GENCODE v29 on hg38, you'll want to download the genome table using the all fields from selected table output format.

Originally this format had these headers: #bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames

but it has now been updated to: #name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID

If you add an empty first column to the gencode.ucsc table using echo empty_col | paste - gencodev29.ucsc > gencodev29.fix.ucsc this should make the annotation compatible.

Alex-Nesta commented 5 years ago

ok, great, thanks for the detailed instructions.

I have two more questions:

1) Does the annotation library determine where TE driven transcripts start? For example, if an RNA-seq read maps outside of the specified txStart and txEnd of the annotation, is is filtered out? Does the annotation strictly assign IDs to the transcripts, and provide no further function?

2) What columns exactly are required from the UCSC table file? If I want to build my own annotation library that is cell type specific, what information is needed?

ababaian commented 5 years ago
  1. No, a transcriptome assembly is made per library and TE-initiations are searched in addition to that assembly. You can feed each library/run of LIONS with a particular GTF file for an annotation if you have a custom one. This UCSC file is strictly used for finding intersections between the assembly and known protein coding genes. The name field here will be used as the 'protein coding' intersection, thus the example UCSC is only protein coding genes.

  2. I'm not 100% certain what is strictly required but I believe its be name chrom strand txStart txEnd to define 'genic' regions. Try it out and it if that doesn't work I can do a dive into the code and give you an explicit definition : )

Lynuxoo commented 1 year ago
  1. No, a transcriptome assembly is made per library and TE-initiations are searched in addition to that assembly. You can feed each library/run of LIONS with a particular GTF file for an annotation if you have a custom one. This UCSC file is strictly used for finding intersections between the assembly and known protein coding genes. The name field here will be used as the 'protein coding' intersection, thus the example UCSC is only protein coding genes.
  2. I'm not 100% certain what is strictly required but I believe its be name chrom strand txStart txEnd to define 'genic' regions. Try it out and it if that doesn't work I can do a dive into the code and give you an explicit definition : )

Hello @ababaian ,

I also have a question regarding the annotation files. I'm a bit confused based on your previous response. Does it mean that the annotation files in the annotation folder are not used for transcriptome assembly? However, I noticed in the eastLion.sh that there is a reference to Cufflinks using the annotation files from this folder. I would appreciate your clarification on this matter.

Thank you in advance for your response.