Manually annotated GFF - Githubissues

apredeus commented 5 years ago

Hello Sung-Huan,

hopefully you are not tired by all the questions. The ANNOgesic paper didn't seem to address this issue in much detail. I was wondering how exactly do you select "manually" validated TSS for the training set?

what sort of requirements do you impose on dRNAseq and RNAseq coverage?
how do you assign TSS to be primary or secondary? On few "secondary" TSS from the training set provided with the tutorial, I could not identify a "primary" one.

Thank you!

apredeus commented 5 years ago

For example, why is this TSS "Primary, Antisense"? The second promoter on the + strand is a separate one, with a position 5 bp to the right from TSS 25.

Thanks!

Sung-Huan commented 5 years ago

hopefully you are not tired by all the questions. Absolutely not!! :)
what sort of requirements do you impose on dRNAseq and RNAseq coverage? I am sorry for the confusion. The manual validated set is original from the paper - https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003495. They provide a TSS set which is predicted one. So we manually checked and edit the TSSs based on their prediction. Because not all of the predicted TSSs are correct, some of them were removed like low expressed one, the coverage of TEX+ not higher than TEX-, the multiple TSSs predicted within 3 nts . Therefore, the types of TSS are not correct now.
how do you assign TSS to be primary or secondary? On few "secondary" TSS from the training set provided with the tutorial, I could not identify a "primary" one. As I mentioned above, the "type" information of the manual detected TSSs is not correct. The "type" information is also not be used for training. Thus, I will remove it later.

For the definition of the type of TSS, please check the paper - https://www.nature.com/articles/nature08756

Thank you again for your questions and comments.

apredeus commented 5 years ago

Thank you for these clarifications, this is very helpful!

apredeus commented 5 years ago

Hello again!

I have few more questions. When I made the manually curated TSS GFF file, I first ran the TSS predictor with default parameters, took the sites in first 300 kb, and filtered the ones that didn't seem reliable.

However, by taking the files generated like this, I have a bias - if the TSS was not detected with default parameters, it would not be used in training.

Do you think it's a reasonable approach? I have not noticed the default predictor missing too many transcript starts, but I guess there are few that don't have a very good dRNA/RNAseq ratio (and still seemed like legit TSS). Should I just go ahead and add them manually?

On a separate note, I have a GFF file that only had CDS in it. However that didn't work with TSS predator because it apparently needed "gene" features to be able to classify TSS, which is quite useful. What sort of a GFF format do you expect with ANNOgesic, i.e. which features are OK to have?

Thank you again for all the answers.

Sung-Huan commented 5 years ago

Indeed, some TSSs are miss detected by using default parameters of TSSpredator. So, I would suggest you to check the whole region of 200 or 300kb. Actually, we did the same thing as you did. But, we also add some missing TSSs into the gff file.

For the format of gff file, I would suggest you at least put gene and CDS in gff file. I think most of the annotation files containing these two features. of course, there are other features in many gff files like tRNA, rRNA, ncRNA, etc. If you have other annotations, you can still put in it. It should not cause problems. We always use the gff files from NCBI. These gff files contain very diverse features. They can still be used for ANNOgesic prediction.

Sung-Huan / ANNOgesic

Manually annotated GFF #5