Gaius-Augustus / Augustus

Genome annotation with AUGUSTUS
http://bioinf.uni-greifswald.de/webaugustus/
272 stars 107 forks source link

Fix for individual liability setting for extrinsic evidence #385

Closed gmanthey closed 1 year ago

gmanthey commented 1 year ago

Hi,

I've found that the individual liability setting for hints does not seem to work as intended. As long as it was not set, all hints of the source where discarded, rendering all hints without this setting useless. This can also be seen in many of the examples (such as in docs/tutorial/results/augustus.hints.gff), where hints are deleted without giving any reason for the deletion.

I think what I've written here should fix the issue and I've tested it on my own data. Basically as soon as a hint is found that is not ok and the individual liability setting is not set for it's source, the sources liability is set to false. Then, after all hints have been processed, all hints belonging to a source that had an unreliable hint are deleted.

From what I can tell, this should have quite a large performance impact especially on BRAKER, as so far many hints might have incorrectly been ignored.

Please let me know if this doesn't produce the desired behavior, I'm also happy to adjust some things if necessary.

Cheers!

KatharinaHoff commented 1 year ago

Thank you for your effort!

I can confirm that it compiles and runs. I cannot confirm that it makes a difference in accuracy (yet), however my test data set (Drosophila melanogaster with protein input from D. anassae via miniprot) might not have been ideal to check this.

Mario needs to further review the pull request.

gmanthey commented 1 year ago

Thanks for the quick reply! I've looked a bit into the BRAKER source files and checked what extrinsicCfgFiles it uses and it looks like on default it is not setting the individual_liability setting, which would imply that hints would never be used in BRAKER.

I don't know how much hints will matter for Augustus performance, but according to your paper, it should make quite a difference. I can also imagine that it matters less when Augustus has been trained for the species compared to non-model species without specific training.

gmanthey commented 1 year ago

After looking through the code and the paper a bit more, I'm not to sure if what I've implemented is quite what was intended. At least it doesnt quite fit the vocabulary used most of the time. The way I understand the vocabulary around the hints is that there is the individual hint (such as a single intron), denoted as one line in the hint GFF, the hint group (such as a whole gene), denoted with the grp attribute in the GFF and the hint source (such as proteins from related species) denoted with the src attribute in the GFF.

What I've implemented now is that if one hint from a whole source is not "ok" all hints and hint groups from that source are disregarded. It first looked like that was the intention in the original (buggy) implementation.

In most of the documentation however, it sounds like the group should be disregarded if a hint is not "ok" instead of the whole source. This would also imply that individual hints would be removed if not "ok" if the individual_liability setting is set.

I have a good idea of what to implement for the second option, so if that would be the desired behaviour instead of what I've implemented now, please let me know :)

gmanthey commented 1 year ago

So sorry for bothering you with this, I seem to have completely overlooked the if statement in original line 541, where the individual_liability is only checked if both strands aren't possible. It seems like in the example I had the introns aren't valid splice patterns, so removing the group was correct behavior. To be fair, the indentation in that file is not helping readability (tab size of 8 leads to indentation of 4??) and the setting is applied somewhat ambiguously, as its not used for short introns/cds with negative lengths. From the description I would also expect individual hints to be deleted if they are unsatisfiable, I might have a look at implementing this if I have time and it is something that you would want. I will close this PR for now however, as those issues with the setting are not as urgent as hints not being used (what I thought initially), again, sorry for the false alarm.