Gaius-Augustus / Augustus

Genome annotation with AUGUSTUS
http://bioinf.uni-greifswald.de/webaugustus/
279 stars 108 forks source link

Size of Flanking region genes during AUGUSTUS training #103

Open ConcettaDe4 opened 4 years ago

ConcettaDe4 commented 4 years ago

Hi! I am using AUGUSTUS to predict genes in a new species. I am following the current protocol of AUGUSTUS and I extracted the flanking regions of the genes of my training and test set. I was wondering how the flanking regions affect the training and the prediction accuracy. Is it better to give a short region, as 100 nt or a larger region as 1000 nt?

Thank you for your help!

Concetta

KatharinaHoff commented 4 years ago

The flanking region should not be too short. I would consider 100 nt very short. Also 1000 nt may be quite short, depending on the genome. You want to capture information about how the intergenic region looks like, but without accidentally including coding region in the flanking region.

If you know nothing about a genome, it's hard to estimate where coding sequences are located adjacent to your training genes. You can try with something like half the mRNA size (computeFlankingRegion.pl does that). In many cases, that works well.

Otherwise, you can also run GeneMark-ES/ET/EP... as it is done in BRAKER. GeneMark does not always get the gene structure right, but it gives you a pretty good idea about areas that carry genes. You can exclude these from your flanking region of genes that were generated in a different way.

Don't choose too large of a flanking region. Assume you have evidence that the flanking region is about 50000 nt long safely, you still don't want to have such a huge number. It will increase computational time significantly. I usually put an upper limit at 10000. Rule of thumb....

Katharina

ConcettaDe4 commented 4 years ago

Hi! Thank you for your help! I have one more question to understand the workflow to choose the right size of the flanking region. When you said "You can exclude these from your flanking region of genes that were generated in a different way." Do you mean to run GeneMark-ES/ET/EP to have an idea about regions that carry genes and then exclude from my training set genes that have in their flanking regions genes predicted by GeneMark?

Thank you,

Concetta

KatharinaHoff commented 4 years ago

Yes, that was the line of thought. Within BRAKER, we already do this: only "evidence anchored genes" from GeneMark are used for training AUGUSTUS, but the non-anchored genes are excluded from flanking regions.

ConcettaDe4 commented 4 years ago

Thank you! I was wondering if the value of sensitivity of AUGUSTUS decrease when the flanking regions include accidentally coding regions.

KatharinaHoff commented 4 years ago

Yes. That may cause problems with accuracy. Problems will be particularly severe for CRF training.

ConcettaDe4 commented 4 years ago

OK. Thank you!

I tested AUGUSTUS performing the training with the same genes training set but using different flanking regions size (100 bp, 1000 bp and 2,157 bp). I notice that when I increased the size of the flanking region the value of sensitivity and specificity decreased especially the specificity.

I was wondering why the sensitivity and specificity decreased with increasing flanking regions size and how the different size of the flanking regions will affect the training of the hmm model?

Thank you,

Concetta

KatharinaHoff commented 4 years ago

If you have a very short flanking region, the probability predicting a false positive (according to training set, not necessarily according to truth) is very low. If you increase the flanking region, that probability not only increases by chance, but if you actually have coding regions in the flanking region, also by the given sequence properties. That's why specificity can decrease with increasing flanking region size.

A model trained with a very short flanking region may not perform well a whole genome level, though! Because intergenic region will be underrepresented in your training set.

ConcettaDe4 commented 4 years ago

Thank you for your explanation, it was very useful!

fanhuan commented 6 months ago

Dear @KatharinaHoff,

Thank you very much for the explanation. I have a further question. So I have a gff from a closely related species (actually same species but different subspecies) that I want to use to train for my genome. Therefore I'd like to convert the gff file into gb file using gff2gbSmallDNA.pl. However I am also stuck with the idea of flanking DNA value.

I understand that "If you know nothing about a genome, it's hard to estimate where coding sequences are located adjacent to your training genes. You can try with something like half the mRNA size (computeFlankingRegion.pl does that). In many cases, that works well." But in my case, I actually know a lot of this genome and do not want to just use an arbitrary number for every gene.

Then you suggested that "Otherwise you can also run GeneMark-ES/ET/EP... as it is done in BRAKER. GeneMark does not always get the gene structure right, but it gives you a pretty good idea about areas that carry genes. You can exclude these from your flanking region of genes that were generated in a different way." In my case I already have a good gff suggesting where genes are. Is there another perl script that can generate the gb file other than gff2gbSmallDNA.pl that does not just add the same size of flanking region to each gene? I might have missed it from your package. Or, how do I ask augustus to 'exclude these from your flanking region', preferably by supplying my gff file to the program?

Also when I tried to use computeFlankingRegion.pl, it always gives me the warning "Warning: transcript_id was not detected in the last column. Will assume that the entire last column serves as transcript identifier!". I took a look at this script, it seems like it only looks at lines in the gff file where the third column is 'CDS'. However when I checked all the gff files I downloaded from NCBI, when the third column is 'CDS', the last column usually (in my case exclusively) only contains 'protein_id" instead of 'transcript_id'. Also if we are only looking at 'CDS' region, then I don't think it is calculating the average mRNA size. Please kindly instruct where I understood wrongly.

I hope I am explaining myself clear enough. Please kindly let me know if not.

Thank you very much.

Best, Huan

KatharinaHoff commented 6 months ago

Dear Huan,

we mostly continue to operate on gtf format, not on gff3 format. I am not surprised if our scripts to not work with NCBI file formats. We never tested on their formats (lack of manpower on our end).

If you have the gff file, you can compute the distance between all genes and take e.g. the mean. It's an operation on the transcript or mRNA lines of the gff3 table.

You find command lines on excluding CDS from neighboring genes from flanking region at https://math-inf.uni-greifswald.de/storages/uni-greifswald/fakultaet/mnf/mathinf/stanke/augustus_wrp.pdf ( the region is not maxed out if there is a neighboring gene, automatically, so you don't need to ake a special postprocessing step if you e.g. convert a full genome scale annotation to generate the genbank file for training augustus).

Best,

Katharina

On Wed, Feb 28, 2024 at 10:35 AM Huan Fan @.***> wrote:

Dear @KatharinaHoff https://github.com/KatharinaHoff,

Thank you very much for the explanation. I have a further question. So I have a gff from a closely related species (actually same species but different subspecies) that I want to use to train for my genome. Therefore I'd like to convert the gff file into gb file using gff2gbSmallDNA.pl. However I am also stuck with the idea of flanking DNA value.

I understand that "If you know nothing about a genome, it's hard to estimate where coding sequences are located adjacent to your training genes. You can try with something like half the mRNA size (computeFlankingRegion.pl does that). In many cases, that works well." But in my case, I actually know a lot of this genome and do not want to just use an arbitrary number for every gene.

Then you suggested that "Otherwise you can also run GeneMark-ES/ET/EP... as it is done in BRAKER. GeneMark does not always get the gene structure right, but it gives you a pretty good idea about areas that carry genes. You can exclude these from your flanking region of genes that were generated in a different way." In my case I already have a good gff suggesting where genes are. Is there another perl script that can generate the gb file other than gff2gbSmallDNA.pl that does not just add the same size of flanking region to each gene? I might have missed it from your package. Or, how do I ask augustus to 'exclude these from your flanking region', preferably by supplying my gff file to the program?

Also when I tried to use computeFlankingRegion.pl, it always gives me the warning "Warning: transcript_id was not detected in the last column. Will assume that the entire last column serves as transcript identifier!". I took a look at this script, it seems like it only looks at lines in the gff file where the third column is 'CDS'. However when I checked all the gff files I downloaded from NCBI, when the third column is 'CDS', the last column usually (in my case exclusively) only contains 'protein_id" instead of 'transcript_id'. Also if we are only looking at 'CDS' region, then I don't think it is calculating the average mRNA size. Please kindly instruct where I understood wrongly.

I hope I am explaining myself clear enough. Please kindly let me know if not.

Thank you very much.

Best, Huan

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/Augustus/issues/103#issuecomment-1968580195, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JHWHKG6FVSBHFYTG73YV324NAVCNFSM4JZQDUCKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJWHA2TQMBRHE2Q . You are receiving this because you were mentioned.Message ID: @.***>

fanhuan commented 6 months ago

Hi @KatharinaHoff,

Thank you so much for the prompt reply! Yes I've been referring to the pdf you mentioned and it has been so helpful! I would say it is very rare to see such a detailed tutorial for a tool and it definitely made Augustus more approachable and accurate from the end-users' side.

Using gtf (instead of gff) is not a problem. I also have gtf. Thanks for the solution on using mRNA line only.

I was trying to follow Basic Protocol 7 for my purpose (train my genome with a well annotated genome from the same species). It needs gb format and directed me to follow Alternative Protocol 2 to generate the gb, where it seems like I still need to use gff2gbSmallDNA.pl, which still works on the DNA_flanking_size. I must have missed how to 'convert a full genome scale annotation to generate the genbank' without using gff2gbSmallDNA.pl from the pdf. Will you please be so kind and direct me to which part of the pdf I should be looking at?

I was also thinking about using tbl2asn to convert gtf to gb. But this will not give me the flanking regions I believe. I am really out of my depth now instead of re-inventing the wheel (write a script to add flanking regions without stepping into neighbouring genes and not exceeding 10,000 nt). Any help would be appreciate!

Best, Huan

lalalagartija commented 5 months ago

I think it is already the default behavior. The --overlap option ignores adjacent genes whhile by default, gff2gbSmallDNA.pl will take care of not including CDS as intergenic (I think...)

KatharinaHoff commented 5 months ago

This is correct. The overlap option turns off the filter. You do not want to use the overlap option.

On Wed, Mar 13, 2024 at 6:20 PM lalalagartija @.***> wrote:

I think it is already the default behavior. The --overlap option ignores adjacent genes whhile by default, gff2gbSmallDNA.pl will take care of not including CDS as intergenic (I think...)

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/Augustus/issues/103#issuecomment-1995063159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JGKRCVGMKFY4E5BMXDYYCDFBAVCNFSM4JZQDUCKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJZGUYDMMZRGU4Q . You are receiving this because you were mentioned.Message ID: @.***>

fanhuan commented 5 months ago

@lalalagartija oh that is great to know! Thank you for pointing that out. Then I will just use 10,000 since I am not too worried about computation load.