Augustus training parameter tuning

royfrancis commented 3 years ago

Modify or add a wrapper around AbinitioTraining to execute parallel training jobs over a range of values for parameters params.model_selection_value and params.locus_distance. Finally traverse the results directories and create a summary table like below:

locus_distance  model_selection_value   exon_sensitivity    exon_specificity    nucleotide_sensitivity  nucleotide_specificity  gene_sensitivity    gene_specificity    genes
1000    0.01    0.412   0.556   0.855   0.988   0.44    0.458   479
1000    0.02    0.474   0.622   0.826   0.966   0.33    0.34    752
2000    0.01    0.503   0.591   0.846   0.983   0.49    0.5 474
2000    0.02    0.531   0.641   0.863   0.977   0.37    0.407   745

The table is to be sorted by genes (low to high).

I am not sure about automatically selecting the best run yet. I would leave it to manual selection for now. (The general idea is to have the largest number of genes while maintaining high values for gene_sensitivity and gene_specificity. In addition, exon level metrics shouldn't be too low either. In the table below, row 43 would be one good choice.)

training_summary_as

subalaris-metrics

Juke34 commented 3 years ago

Excellent. For locus distance I would center the search around the mean intron length after running agat_sp_add_introns.pl + agat_sp_manage_introns.pl. Like that the value is automatically adapted to species investigated (e.g. for fungi ~ 1000bp while for bird ~ 10 000bp). For model selection value I wwould suggest to use bigger steps like 0.05 from 0 to 0.5, and skip the test if number of slected genes is below 750.

Juke34 commented 3 years ago

btw for selecting the best model automatically (and manually too) you should not forget to put a higher weight on the sensitivity rather than the specificity, Because we know that the prediction made from evidence is most of time incomplete. Ab initio prediction will annotate genes missing from evidence-based annotation e.g. from RNAseq (e.g. low expressed genes, or tissue specific expressed genes or evolution stage specific genes). There is rarely RNAseq data that cover all possibilities...

royfrancis commented 3 years ago

Very good points Jacques! At the moment, I was only thinking of parallel runs from a pre-defined range of ld and msv values. This is a reasonable first step and the table (like that shown above) is already very helpful to manually pick good runs.

As a next step, this can of course be extended to automated parameter tuning. But, I don't think it's going to be trivial. There are so many questions. Default starting values are one thing (mean intron length for LD sounds like a good idea), and how to traverse the parameter grid space is another question (how many steps, what kind of grid... regular, random, max_entropy, latin_hypercube etc) and most importantly, what is a good run? What metric to use? So far we are relating ld/msv to specificity/sensivity/genes, but perhaps it should extend further downstream. How does lsd/msv affect the end point annotations? That would be quite compute intensive. Imagine the table above, 59 runs of not just training, but training+abinitio. And then you have busco metrics for every combination of ld/msv. Maybe it's overkill! But, I am curious to see some curves. I don't even know what is a good metric to evaluate annotations. AED? Something else?

NBISweden / pipelines-nextflow

Augustus training parameter tuning #57