Open matentzn opened 2 years ago
I would suggest using some log transforms on values that could potentially get very high to keep them all in a reasonable range, then picking a multipler that seems appropriate.
A technical implementation note: all of these constants should be ultimately stored in some data file that's not part of code that is easily accessible and viewable, ideally, with an explanation of how each parameter is used in the same file (like YAML)
Yes, it would be added to:
https://github.com/OBOFoundry/obo-dash.github.io/blob/gh-pages/dashboard/dashboard-results.yml#L87
I would suggest using some log transforms on values that could potentially get very high to keep them all in a reasonable range, then picking a multipler that seems appropriate.
hmm, why would that be better then providing a ceiling? I want something that easily adds up to 100, and using log transforms, which is the first thing I tried, resulted in either no single ontology being 1.0 or some of them going over 1.0.
It's not necessarily, but it allows you to better reflect the heterogeneity in the size of these errors based on ontologies. In the end you could also use whatever scoring you want then just divide through by the max raw score of all ontologies to get something between zero and one (or min-max normalize if you want to assign the worst ontology a score of zero :p)
My one concern, as always, is that some of the dashboard evaluations are imprecise/incomplete/inaccurate (the EWG is in the process of evaluating this). That is, an ontology could "pass" on some particular aspect according to the dashboard evaluation, but not actually adhere to the principle, and the opposite is also possible (non-green when the principle is fully adhered to).
This is not an argument against the scoring system as given, which doesn't concern itself with individual principles. But I'm wondering if the imprecision needs to be taken into account some way; for example, by downplaying the impact of those principles that aren't fully and accurately assessed. Obviously the first step is to try to align the assessment with what should be assessed. But, there are certainly some principles for which this will be impossible. Not every 'warn' or 'info' is of equal importance.
So the question I'm asking is: should imprecision in assessment (and possible imbalance in importance) be accounted for in some way, in the step leading to the input to the above scoring system?
What will be the governance for deciding that the scoring method is fair and/or updated over time? Will the ontologies be displayed by score, or alphabetically or? One concern I have is that this score is mostly about adherence to the evaluation metrics and might not value utility or usage enough. Will an ontology, such as the HPO for example, be poorly scored because we can't license it with a standard license, but yet it is one of the most well-adopted OBO ontologies? One of the more common criticisms I hear about the OBO Foundry continues to be the fact that there are only a small number of approved ontologies (Foundry ones, listed at the top). This does the ontologies that have not undergone the manual review process a dissservice, as well as the whole of OBO imho. We don't want the score to do the same thing. (Otherwise i like the idea of this score quite a lot).
I agree that the order in which ontologies are displayed by default is important, and so is the information we provide to users for sorting and filtering. These scores are meant to help with that, but given the diversity of OBO projects it's hard to find a single approach that everyone sees as fair. We can (and will) provide multiple ways to sort, but multiple sorting options is also confusing.
@matentzn has helpfully split this discussion into two issues: this Dashboard score, and Impact score (#65). @mellybelly's points about usage are a closer fit with the Impact score, and that seems like the harder question to me. This Dashboard score is closely tied to the OBO Principles as metrics for conformance, for which our governance (such as it is) is most clear. It's still difficult to find a fair weighting formula for the Dashboard score.
The formula proposed here using ceilings tends to favour smaller ontologies, I think. Maybe larger ontologies already have enough of an advantage?
Thank you all. @mellybelly good question about governance - this issue is about collecting opinions, which I will then distill into a single issue with all options to be voted on. We will use GitHub for the voting process. Much of our decision making will be made this way moving forward: github issues for transparent open discussions, then a GitHub issue comment calling for a vote (with an expire date). We should probably capture that in our SOPs!
I suspect we need to think about the different types of users who would come to OBO Foundry and order the ontologies according to their needs. For me, and I suspect all of us here, the order doesn't matter because I already know which one I want to use and I can just CTRL-F to go straight there. Newer users will need more info. As far as the scores go, let's think about what we want to achieve with the scores. Are we trying to direct new users to the most used ontologies? Are we trying to reward people for maintaining their ontologies? Once we articulate our goals, we can use that as a was to ground our decision-making.
The formula proposed here using ceilings tends to favour smaller ontologies, I think. Maybe larger ontologies already have enough of an advantage?
One thought that came to me to simplify the score was to rather than count the number of ROBOT report errors, count the kinds of ROBOT report errors. This is removes the bias against big ontologies (which I do not think they really have an inherent advantage due to their size) but focuses the score to the binary "conforms to this check yes/no" logic we have for the other OBO principles checks.
As far as the scores go, let's think about what we want to achieve with the scores.
Great point @diatomsRcool to clarify this. This is my take on the dashboard score (NOT the impact score #65). For me, the dashboard score inherently conveys the degree of FAIRness: The more standardised the content according to common principles by the OBO Foundry, the easier it will be to combine this ontology with others and use standard tooling.
All that said, I think the score should be primarily a motivator to fix your dashboard situation. In fact, I will suggest, if nothing else, to sort the dashboard by dashboard score once this is all decided.
Ok I have revised the complicated score to something much more easy:
For every principle, if its in ERROR you get 0 points, you get WARN 66% of the maximal points and if its INFO you get 90% of the maximal points (PASS is 100%).
This does not take into account the gravity of the ROBOT report (say 1000 errors vs 1 error in the ROBOT report will result in the same penalty), but it also abstracts from the disadvantage imposed on large ontologies.
The OBO Dashboard score is a numeric value between 0 and 1 that indicates how well an ontology does on the OBO dashboard. It forms a central part of the OBO score #26 , which itself combines, as far as we are thinking right now, a notion of quality (OBO Dashboard score) and a notion of impact (Impact score). There are considerable concerns about the Impact score, so please let us not raise them here; this ticket is purely to define the OBO Dashboard score. This is my current proposal, please let me know how you feel about it:
Variables from which the Dashboard score is computed, along with their maximum penalty. Note that the penalty sums up to a maximum penalty of 100.
no_base
(ontology does not provide base file, 0,1): 10overall_error
(number of ERRORs on dashboard, the red boxes): 40overall_warning
(number of WARN on dashboard, the yellow boxes): 20overall_info
(number if INFO messages on dashboard, they blue boxes): 10report_errors
(number of ROBOT report errors): 15report_warning
(number of ROBOT report warnings): 4report_info
(number of ROBOT report info messages): 1Example:
The NCO (ontology of all things related to Nico) has the following metrics:
no_base
= 1 (NCO does not provide a base file!)overall_error
= 4overall_warning
= 6overall_info
= 1report_errors
= 2500report_warning
= 200report_info
= 100penalty=10(1) + 5(4)+0.5(6)+0.1(1)+0.05(2500)+0.01(200)+0.005(100)
penalty=10 + 20 + 3 + 0.1 + 125 + 2 + 0.5 all are withing their limits except for
report_errors
score, so lets normalise it:penalty=10 + 20 + 3 + 0.1 + 15 + 2 + 0.5 penalty = 50.6
1-(50.6/100) = 0.494
So NCO gets a miserable 0.494 Dashboard score!
Note that I played with this formula quite a lot. The ceiling is necessary because for some values, like the report ones, there is no notion of 100% that I would comfortably apply, unless we use something fairly random like "divided by number of terms". Let me know what you think.