Closed Alex-Nesta closed 5 years ago
Hey @Alex-Nesta,
You can use any annotation library, the only hitch is that it will need to be formatted as a UCSC table output. If you're interested in using gencode then the easy solution is that it is already available at the UCSC Genome Table Browser.
For GENCODE v29 on hg38, you'll want to download the genome table using the all fields from selected table
output format.
Originally this format had these headers:
#bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames
but it has now been updated to:
#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID
If you add an empty first column to the gencode.ucsc table using echo empty_col | paste - gencodev29.ucsc > gencodev29.fix.ucsc
this should make the annotation compatible.
ok, great, thanks for the detailed instructions.
I have two more questions:
1) Does the annotation library determine where TE driven transcripts start? For example, if an RNA-seq read maps outside of the specified txStart and txEnd of the annotation, is is filtered out? Does the annotation strictly assign IDs to the transcripts, and provide no further function?
2) What columns exactly are required from the UCSC table file? If I want to build my own annotation library that is cell type specific, what information is needed?
No, a transcriptome assembly is made per library and TE-initiations are searched in addition to that assembly. You can feed each library/run of LIONS with a particular GTF file for an annotation if you have a custom one. This UCSC file is strictly used for finding intersections between the assembly and known protein coding genes. The name field here will be used as the 'protein coding' intersection, thus the example UCSC is only protein coding genes.
I'm not 100% certain what is strictly required but I believe its be name chrom strand txStart txEnd
to define 'genic' regions. Try it out and it if that doesn't work I can do a dive into the code and give you an explicit definition : )
- No, a transcriptome assembly is made per library and TE-initiations are searched in addition to that assembly. You can feed each library/run of LIONS with a particular GTF file for an annotation if you have a custom one. This UCSC file is strictly used for finding intersections between the assembly and known protein coding genes. The name field here will be used as the 'protein coding' intersection, thus the example UCSC is only protein coding genes.
- I'm not 100% certain what is strictly required but I believe its be
name chrom strand txStart txEnd
to define 'genic' regions. Try it out and it if that doesn't work I can do a dive into the code and give you an explicit definition : )
Hello @ababaian ,
I also have a question regarding the annotation files. I'm a bit confused based on your previous response. Does it mean that the annotation files in the annotation
folder are not used for transcriptome assembly? However, I noticed in the eastLion.sh
that there is a reference to Cufflinks using the annotation files from this folder. I would appreciate your clarification on this matter.
Thank you in advance for your response.
Is it possible to use a custom annotation library? I.E. gencode v29 instead of the UCSC Refseq table?
can I convert using something like gtfToGenePred? http://hgdownload.soe.ucsc.edu/admin/exe/
looking forward to testing your tool on my own data!