Sung-Huan / ANNOgesic

ANNOgesic - A Swiss army knife for the RNA-Seq based annotation of bacterial/archaeal genomes
http://annogesic.readthedocs.io/en/latest/index.html
Other
30 stars 13 forks source link

Manually annotated GFF #5

Closed apredeus closed 5 years ago

apredeus commented 5 years ago

Hello Sung-Huan,

hopefully you are not tired by all the questions. The ANNOgesic paper didn't seem to address this issue in much detail. I was wondering how exactly do you select "manually" validated TSS for the training set?

Thank you!

apredeus commented 5 years ago

For example, why is this TSS "Primary, Antisense"? The second promoter on the + strand is a separate one, with a position 5 bp to the right from TSS 25.

image

Thanks!

Sung-Huan commented 5 years ago

For the definition of the type of TSS, please check the paper - https://www.nature.com/articles/nature08756

Thank you again for your questions and comments.

apredeus commented 5 years ago

Thank you for these clarifications, this is very helpful!

apredeus commented 5 years ago

Hello again!

I have few more questions. When I made the manually curated TSS GFF file, I first ran the TSS predictor with default parameters, took the sites in first 300 kb, and filtered the ones that didn't seem reliable.

However, by taking the files generated like this, I have a bias - if the TSS was not detected with default parameters, it would not be used in training.

Do you think it's a reasonable approach? I have not noticed the default predictor missing too many transcript starts, but I guess there are few that don't have a very good dRNA/RNAseq ratio (and still seemed like legit TSS). Should I just go ahead and add them manually?

On a separate note, I have a GFF file that only had CDS in it. However that didn't work with TSS predator because it apparently needed "gene" features to be able to classify TSS, which is quite useful. What sort of a GFF format do you expect with ANNOgesic, i.e. which features are OK to have?

Thank you again for all the answers.

Sung-Huan commented 5 years ago

Indeed, some TSSs are miss detected by using default parameters of TSSpredator. So, I would suggest you to check the whole region of 200 or 300kb. Actually, we did the same thing as you did. But, we also add some missing TSSs into the gff file.

For the format of gff file, I would suggest you at least put gene and CDS in gff file. I think most of the annotation files containing these two features. of course, there are other features in many gff files like tRNA, rRNA, ncRNA, etc. If you have other annotations, you can still put in it. It should not cause problems. We always use the gff files from NCBI. These gff files contain very diverse features. They can still be used for ANNOgesic prediction.