NBISweden / pipelines-nextflow

A set of workflows written in Nextflow for Genome Annotation.
GNU General Public License v3.0
42 stars 18 forks source link

AED filter not working? #50

Closed ViriatoII closed 10 months ago

ViriatoII commented 3 years ago

Hello,

I'm running the AbinitioTraining pipeline and noticed that there are still genes with high AEDs (including some AEDs of 1.0!) in this file: codingGeneFeatures.filter.longest_cds.complete.good_distance.gff

Which I think is already after AED filtering. Is this a bug? And by the way is your default 0.3 AED threshold not too low?

(I guess I'm not using the most updated version, It's from April last year)

Kind regards, Ricardo

Juke34 commented 3 years ago

There was an error how we set the filter in a previous version. You should update the pipeline to its latest version, and let us know if the problem is remaining

ViriatoII commented 3 years ago

Great, thank you. Could you please also answer my question about the 0.3 AED threshold? I expected something like 0.5, not conservative enough?

I'm reinstalling and will close the issue if it's solved.

Juke34 commented 3 years ago

By experience we found this result satisfying, it is better to have less models of better quality. As general rule we can say that from 700 genes you can create a good abinitio model, then you gain very few until 1000 then yo kind of reach a plateau. So if with the defaut value you get less than 700 genes, you can be less stringent and put 0.5

ViriatoII commented 3 years ago

Thank you for the kind answer. Unfortunately after updating the problem persists. I conda updated nexflow and replaced AbinitioTraining.nf and other files of the AbinitioTraining/ folder with the newest ones.

ViriatoII commented 3 years ago

Hi again,

Sorry, I made a mistake. It's actually eAEDs that go to those values. I was doing sed 's/.*AED//' and forgot eAED would also be valid for that expression.

That means your filter is working as supposed to. Does it make sense to keep proteins with eAED=1.00, though?

Thank you for the attention, Ricardo

Juke34 commented 3 years ago

true the eAED has another meaning, we never checked how would affect filtering on those values. If you have any information that show it would benefit to the pipeline, you are welcome to inform us.

ViriatoII commented 3 years ago

I only have this information:

"eAED is an extended AED calculation that does some inference about the evidence (i.e. checks reading frame and not just overlap, and may infer support for an exon if by splice sites are confirmed etc.). If eAED is 1 that means that while there is evidence supporting the model, the evidence is more likely to be spurious, so it may be a false model."

https://groups.google.com/g/maker-devel/c/cZaztT1XPjE/m/nFzwyC9JCAAJ?pli=1

"eAED is a little more than that. It is still at a base pair level, but it uses exon inference from mRNAseq and in addition has a protein evidence overlap correction for each exon. Sometimes the overlapping protein reading frame doesn't perfectly match the entire model (exonerate for example can shift the reading frames by 1 in the middle to get a better alignment and BLAST can't). This means that just because you have physical overlap it may be meaningless overlap. eAED corrects for this. Also it can use inferred evidence from mRNAseq. For example if you have mRNAseq data that confirm both ends of an exon around the splice site, but not the middle of the exon, then there is probably enough correlated data for MAKER to confidently infer the middle of the exon. It then counts those base pairs as confirmed even though they don't physically overlap. This is common with mRNAseq data as it can taper off in the middle of long exons or won't align correctly around short exons. In both cases the middle gets left out. MAKER uses information from the reading frame and the ab initio predictors to infer if those regions.

Most of the time eAED and AED will be identical. eAED can sometimes be higher for certain mRNAseq alignments and lower for what are apparent spurious protein alignments."

https://groups.google.com/g/maker-devel/c/wtmNRtRa-ko/m/iC4KTuIitGEJ

I would say eAED=1.0 is not so great then, maybe at least under 0.8 could be good.

Juke34 commented 3 years ago

Ok sounds interesting, however the pipeline is intended to be generalized, and the use of eAED for filtering sounds usefull only when using mRNAseq. So might be implemented as an option in the pipeline.

mahesh-panchal commented 10 months ago

Closing as the original issue was solved. Please add a new issue if the enhancement is still desired.