NCEAS / metadig-engine

MetaDig Engine: multi-dialect metadata assessment engine
7 stars 5 forks source link

Determine best method to calculate and display aggregated quality statistics #205

Closed gothub closed 4 years ago

gothub commented 5 years ago

The current implementation of calculating and displaying aggregating metadata quality statistics is to group quality scores by time period (month), based on dataUploaded, and calculate the average score for all quality reports for that time period. Scores contributing to the average can be for all metadata documents updated in that period, or only the 'un-obsoleted' documents in that period, depending on the Solr query used. These averages can be displayed via a line graph, for example.

Additionally, the median, 25th & 75th percentiles, min and max can be calculated for each time period to support displaying these data as box plots.

Do these methods accurately depict:

Comments from @mbjones via email regarding this question:

The way I interpret your current stat is that each month is the average of all EML docs updated that month, regardless of their obsoletion status. So a doc revised 10 times that month would count 10 times, and one revised once in a month counts once Documents not revised that month don’t get counted. So in a typical editing scenario, this number could vary a lot based on the quality of the highly edited doc. And this reflects changes that month, but not really the corpus overall. I would suggest two possibilities:

  • average/median of all docs updated in a month, excluding obsoleted docs (ie only count the current version in the series each month)
  • Let’s also think a bit more deeply about what a time series stat should reflect as I mentioned before (that month’s changes, or cumulative collection - wide including changes up to that time). That latter would better reflect the collection status at that time and be far less variable. So that would mean including all current docs in the collection at each month.
amoeba commented 5 years ago

I feel like we musta had some past discussions on this but I'm not seeing them over on https://github.com/NCEAS/metadig.

Re: how to calculate:

I think counting all of the mid-month updates in a version chain doesn't make sense, especially now with the new MetacatUI editor making it easier to save versions and arcticdata.io's moderation where a first version comes in without basic metadata and is quickly updated to be more complete.

What makes the most sense to me is to make the calculation behave as if it had actually been calculated at the end of the time bin. e.g., A historical calculation for September, 2015 would include the same data as if we had a time machine and went back to September 30th at 23:59:59.9 hours and made the calculation then. And that calculation only considered the object at the head of all version chains (I think this is a bit tricky or at least requires some post-query data massaging to get at).

Re: how to display:

This is tough. Maybe the most helpful thing would be to have the score data plotted for us, shown with means, medians, boxplots, etc. Averages are really deceiving when scores aren't bell-curve-shaped. Medians, box plots, and eventually just showing the raw data (i.e., beeswarm plots) are often better at showing what's going on.

Also, I think it may depend on how the quality suites are constructed. Do most records get a fair bit of points or are there lots of low or zero scores? Do records get scores evenly across the entire range from 0-100% or is it really patchy because scores for particular checks are correlated (e.g., a check for an ORCID being passed also means a check for a creator and/or contact is likely to pass).

Do you think we could see the plots a few ways and compare interpretation?

mbjones commented 5 years ago

@amoeba I'm with ya on the how to calculate, and that's what I said in my last email to Peter. There are two metrics of interest each month: 1) the first reflects the cumulative state of the records at the end of each month, which as you indicate would be best represented by the version at the head of each extant version chain that month; 2) the second reflects what changes occurred in just that month, and would incorporate only the most recent version of documents that changed that month.

In terms of how to display, it might be fastest to do some prototyping graphs in R that pull from data from the SOLR queries and show the data in multiple ways. Once we have the data in R, we can create many graphs really quickly, so we can then make decisions on what to reify in D3 graphs.

gothub commented 5 years ago

I'm in full agreement that quantities 1) and 2) should be calculated. How to derive these quantities from a Solr index analytics query isn't clear. One of the hurdles is that Solr index doesn't have the ability to find the head of a version chain for a time period, as Solr can't compare one document to another. It may be possible to index a 'psuedo series id' that would help. @rushirajnenuji and I thought a bit about this weeks ago, but didn't find a solution. Any solution that requires post-processing after the query (processing facet results) I would think would be way to slow. Here is a query that is close to the one used by MetacatUI to create the graph shown in InVision, if you are interested:

curl -v --data-urlencode analytics@FAIR-score-range-facet-query.json -X POST 'https://docker-ucsb-4.dataone.org:8983/solr/quality/select?q=suiteId:%22FAIR.suite.1%22&sort=dateUploaded%20asc&wt=json&rows=0'

FAIR-score-range-facet-query.json:

{
    "groupings" : {
        "scores" : {
            "expressions" : {
                "min_score"        : "min(scoreOverall)",
                "25-percentile"    : "percentile(25.0, scoreOverall)",
                "mean"             : "mean(scoreOverall)",
                "median"           : "median(scoreOverall)",
                "75-percentile"    : "percentile(75.0, scoreOverall)",
                "max_score"        : "max(scoreOverall)",
                "count"            : "count(scoreOverall)",
                "FinaableAve"      : "mean(scoreByType_Findable_f)",
                "AccessibleAve"    : "mean(scoreByType_Accessible_f)",
                "InteroperableAve" : "mean(scoreByType_Interoperable_f)",
                "ReuseableAve"     : "mean(scoreByType_Reusable_f)"
            },
            "facets" : {
                "scoresByDateRange" : {
                    "type" : "range",
                    "field" : "dateUploaded",
                    "start" : "2012-01-01T00:00:00.000Z",
                    "end" : "2020-01-01T00:00:00.000Z",
                    "gaps" : [
                        "+1MONTH"
                    ],
                    "hardend" : true,
                    "include" : [
                        "lower",
                        "upper"
                    ],
                    "others" : [
                        "none"
                    ]
                }
            }
        }
    }
}
gothub commented 5 years ago

The quality chart at https://search-dev.test.dataone.org/profile shows the running average across all data holdings for EML, ISO. I have not found an efficient way to have Solr Analytics calculate a running average, so I'm calculating this based on the average score for a time period (month), the number of documents for a time period and the total number of documents. This summarizes the corpus through time.

As you have mentioned, this under estimates the quality score as the mean is sensitive to outliers and is including all revisions of a document that falls within a time period. The only solution to this problem that I see is to manually prune the quality index so that only the latest revision of a document is retained per time period, which is what y'all suggested, albeit from a calculation made, not removing unwanted data.

Also, if you have any ideas on how to calculate accumulated values for any other stat, please let me know. I believe that any other stat would require a separate query for each time period, as values such as the median can't be accumulated like the mean.

Here is a sample of the data returned as reference:

date,i, mean, count, total count, current ave, sum of sums
2012-01-01T00:00:00Z 0 71.22400820787463 448 448 71.22400820787463 31908.355677127834
2012-02-01T00:00:00Z 1 69.7113092830431 353 801 70.55736311366049 56516.447854042046
2012-03-01T00:00:00Z 2 67.79572350137374 544 1345 69.44038768683224 93397.32143878937
2012-04-01T00:00:00Z 3 71.93399270301435 1340 2685 70.68486840254323 189788.8716608286
2012-05-01T00:00:00Z 4 61.605597496032715 100 2785 70.35886226586422 195949.43141043186
2012-06-01T00:00:00Z 5 65.94583405785157 284 3069 69.9504881990426 214678.0482828617
2012-07-01T00:00:00Z 6 63.93587236051206 135 3204 69.69706337438541 223309.39105153084
2012-08-01T00:00:00Z 7 63.0581355528023 289 3493 69.14777905132857 241533.1922262907
2012-09-01T00:00:00Z 8 55.307986345260765 157 3650 68.55247837876621 250216.54608249664
2012-10-01T00:00:00Z 9 62.38511401940795 68 3718 68.43968096713728 254458.73383581638
2012-11-01T00:00:00Z 10 69.68388757650456 259 3977 68.52070925776492 272506.86071813107
2012-12-01T00:00:00Z 11 68.40451964229908 212 4189 68.51482904805407 287008.61888229847
2013-01-01T00:00:00Z 12 74.06856403695589 491 4680 69.09749654368457 323376.2838244438
2013-02-01T00:00:00Z 13 69.81545891401902 265 4945 69.13597177687743 341877.38043665886
2013-03-01T00:00:00Z 14 71.60937518574471 215 5160 69.23903025224689 357273.396101594
2013-04-01T00:00:00Z 15 72.56786800765744 384 5544 69.4695991011065 385139.4574165344
2013-05-01T00:00:00Z 16 73.71576308830404 319 5863 69.70062866138554 408654.7858417034
2013-06-01T00:00:00Z 17 71.01209929092325 278 6141 69.75999828115617 428396.1494445801
2013-07-01T00:00:00Z 18 68.01092402759569 231 6372 69.69659022205818 444106.6728949547
2013-08-01T00:00:00Z 19 65.29851676348378 136 6508 69.6046821104469 452987.2711747885
2013-09-01T00:00:00Z 20 71.94657893065947 415 6923 69.74506737411703 482845.10143101215
2013-10-01T00:00:00Z 21 75.35117375631481 707 7630 70.2645322774216 536118.3812767267
2013-11-01T00:00:00Z 22 56.011340695307595 363 7993 69.61722731754327 556450.4979491234
2013-12-01T00:00:00Z 23 65.78483858685584 471 8464 69.4039646648786 587435.1569235325
2014-01-01T00:00:00Z 24 65.07807651036222 213 8677 69.2977742561069 601296.7872202396
2014-02-01T00:00:00Z 25 59.28512658621814 148 8825 69.12985676543909 610070.9859549999
2014-03-01T00:00:00Z 26 66.25545567058654 166 8991 69.07678696433292 621069.3915963173
2014-04-01T00:00:00Z 27 69.01962909292666 94 9085 69.07619556753467 627557.2367310524
2014-05-01T00:00:00Z 28 70.27567192912102 60 9145 69.08406528669214 631773.7770467997
2014-06-01T00:00:00Z 29 64.20215602537886 201 9346 68.97907237405316 644678.4104079008
2014-07-01T00:00:00Z 30 59.14567720007013 81 9427 68.8945804880775 649469.2102611065
2014-08-01T00:00:00Z 31 58.78187952533601 63 9490 68.82744664606983 653172.4686712027
2014-09-01T00:00:00Z 32 60.7411953608195 75 9565 68.76404164383315 657728.0583232641
2014-10-01T00:00:00Z 33 62.70779197866266 110 9675 68.69518505849271 664625.915440917
2014-11-01T00:00:00Z 34 50.79220417691941 1891 11566 65.7681111481473 760673.9735394716
2014-12-01T00:00:00Z 35 54.937959769192865 34 11600 65.73636760100209 762541.8641716242
2015-01-01T00:00:00Z 36 54.237933716466344 310 11910 65.43708006916279 779355.6236237288
2015-02-01T00:00:00Z 37 76.50980782667473 23147 35057 72.74804305524627 2550328.1453877687
2015-03-01T00:00:00Z 38 73.42902213743288 584 35641 72.7592013219615 2593210.6943160295
2015-04-01T00:00:00Z 39 71.54253431117337 905 36546 72.72907261718495 2657956.6878676414
2015-05-01T00:00:00Z 40 67.98811806424538 6030 42576 72.05761555324692 3067925.039795041
2015-06-01T00:00:00Z 41 72.43940128114423 4060 46636 72.09085275316251 3362029.0089964867
2015-07-01T00:00:00Z 42 68.21016816700339 5454 52090 71.68453189056102 3734047.266179323
2015-08-01T00:00:00Z 43 75.98384206872721 5753 57843 72.11213646596323 4171182.309600711
2015-09-01T00:00:00Z 44 75.55922065295425 2601 60444 72.26046989807169 4367711.842519045
2015-10-01T00:00:00Z 45 69.89829821377685 194 60638 72.25291256922256 4381272.112372518
2015-11-01T00:00:00Z 46 63.83493549310335 404 61042 72.19719908115283 4407061.426311731
2015-12-01T00:00:00Z 47 73.89861470435089 462 61504 72.20997961604353 4441202.586305141
2016-01-01T00:00:00Z 48 63.57361947576078 313 61817 72.16625085657755 4461101.129201055
2016-02-01T00:00:00Z 49 61.968103314505406 1958 63775 71.85315053690093 4582434.675490856
2016-03-01T00:00:00Z 50 65.01158682734302 5188 68963 71.33846827938332 4919714.787951112
2016-04-01T00:00:00Z 51 51.699070325237905 4472 73435 70.14248015858345 5150913.030445576
2016-05-01T00:00:00Z 52 63.73085790441697 4150 77585 69.79952427336349 5415396.090748906
2016-06-01T00:00:00Z 53 72.01544312261493 732 78317 69.82023564634319 5468111.39511466
2016-07-01T00:00:00Z 54 62.31370594274873 2304 80621 69.6057128242859 5611682.173606753
2016-08-01T00:00:00Z 55 68.64902702560575 3175 83796 69.56946434690262 5829642.834413052
2016-09-01T00:00:00Z 56 64.39625198642412 396 84192 69.54513196265293 5855143.750199676
2016-10-01T00:00:00Z 57 68.04778620995894 641 84833 69.53381798545684 5898762.381160259
2016-11-01T00:00:00Z 58 66.65090660345719 651 85484 69.5118632885582 5942152.12135911
2016-12-01T00:00:00Z 59 63.95358727724329 2154 87638 69.37524987282106 6079908.148354292
2017-01-01T00:00:00Z 60 45.13043258459337 1240 88878 69.03699323521218 6135869.884759188
2017-02-01T00:00:00Z 61 68.79433032399031 449 89327 69.03577349597165 6166758.539074659
2017-03-01T00:00:00Z 62 62.55121668958036 5387 94714 68.66695465698237 6503721.943381429
2017-04-01T00:00:00Z 63 63.38804684265664 3800 98514 68.46333030212482 6744596.521383524
2017-05-01T00:00:00Z 64 61.92175865074609 4226 102740 68.19425611681504 7006277.873441577
2017-06-01T00:00:00Z 65 68.93183645747956 735 103475 68.19949527168711 7056942.773237824
2017-07-01T00:00:00Z 66 72.3269909786469 538 104013 68.22084445583087 7095854.6943843365
2017-08-01T00:00:00Z 67 73.76819189334633 263538 367551 72.19835192823048 26536576.44957304
2017-09-01T00:00:00Z 68 71.93193154694933 7936 375487 72.19272107510947 27107428.25832963
2017-10-01T00:00:00Z 69 70.6596018876235 14285 389772 72.13653282250735 28116800.67129433
2017-11-01T00:00:00Z 70 72.11861787527612 5859 395631 72.13626751550201 28539343.653425574
2017-12-01T00:00:00Z 71 74.11054880479976 28698 424329 72.26979108905051 30666168.183025718
2018-01-01T00:00:00Z 72 73.53472170694081 24801 449130 72.33964067436946 32489902.816079557
2018-02-01T00:00:00Z 73 72.85776487916279 32883 482013 72.37498718623993 34885684.69860107
2018-03-01T00:00:00Z 74 72.07776158924419 9640 491653 72.36915938745697 35580514.32032138
2018-04-01T00:00:00Z 75 74.19451257570762 85137 576790 72.6385903664755 41897212.5374794
2018-05-01T00:00:00Z 76 66.27328427252817 816 577606 72.62959792219226 41951291.53744578
2018-06-01T00:00:00Z 77 69.24037970920914 468 578074 72.62685406219566 41983696.03514969
2018-07-01T00:00:00Z 78 61.44924833168906 744 578818 72.61248661221397 42029414.27590847
2018-08-01T00:00:00Z 79 59.7685020170574 1527 580345 72.57869160325068 42120680.77848852
2018-09-01T00:00:00Z 80 64.27697700278536 724 581069 72.56834783793066 42167217.30983853
2018-10-01T00:00:00Z 81 63.55417472845596 917 581986 72.55414475273379 42225496.48806453
2018-11-01T00:00:00Z 82 69.26219816561098 918 582904 72.54896035364409 42289079.18598056
2018-12-01T00:00:00Z 83 73.09224300113013 281 583185 72.54922212722185 42309618.106263876
2019-01-01T00:00:00Z 84 69.17755623169133 1015 584200 72.543364131871 42379833.32583904
2019-02-01T00:00:00Z 85 62.15831431531414 970 585170 72.52614947916827 42440126.8907249
2019-03-01T00:00:00Z 86 50.96019907395897 12136 597306 72.0879747845936 43058579.86668646
2019-04-01T00:00:00Z 87 73.28154847997686 184 597490 72.08834235151514 43072063.67160678
2019-05-01T00:00:00Z 88 0 0 597490 72.08834235151514 43072063.67160678
gothub commented 4 years ago

The graphics are currently being generated with the R scripts

These scripts create the graphics as outlined above.