dataset metrics - Githubissues

MortenHofft commented 7 years ago

We show very few metrics for datasets at this point and only for occurrence datasets

There is probably a bunch of interesting stuff we could do. But what and for whom?

On the current site we show checklist stats. That could be a candidate. More usage stats a la: seen how often. which countries. altmetrics. download link followed n times. other download stats (how often, what filters, downloaded from which countries, cited how often, articles making use of, ...)

I imagine that many people will have many wonderful ideas of what this area could be. So bring your ideas forward if you have them please.

MortenHofft commented 7 years ago

@peterdesmet and @mdoering I imagine that the two of you have ideas galore for this area.

I suggest we insert metrics async and if there is many calls with much data being transfered, then create a proxy that only returns the minimum. I would think that made for the better user experience. Avoid reflows as much as possible by keeping size of the async metrics containers. I suggest client side templating as they will likely have to rerender with browser resizing

peterdesmet commented 7 years ago

3 quick ideas:

I would keep all metrics on a separate tab to avoid cluttering the main page of a dataset
I would show some key metrics on the dataset page though and choose wisely which ones. :-)
Maybe the metrics page should be mock-uped (is that a verb?) and developed in a hackathon?

kcopas commented 7 years ago

'mocked up' is, @peterdesmet!

MortenHofft commented 7 years ago

Just to be sure: idea 1 and 2 is contradictory right? Either way, the idea - and the way it is currently – is exactly that. That is: few selected metrics on main page, additional on metrics tab.

It is another discussion if the summary metrics is the one most people is after - I have simply assumed that it is. Those are currently: occurrence count and things related to the completeness (georeferenced, dated, known taxon) and images, because it entice exploration. And for checklists: taxa and species count (similar to what we do on the current site).

Mockups, hackathons etc could all be fine - but I would be happy just to get suggestions and ideas as they come to mind. I have a feeling that @kcopas, Tim Hirsch and others also have suggestions for this. Mock pictures, references etc would also be nice to have.

peterdesmet commented 7 years ago

Yeah, that's what I meant with 1 & 2: extensive metrics page + selected on main page. Clicking on one from main page would open metrics page with more info one those metrics. Will think more about why metrics we could show, but two categories are:

Completeness / data quality metrics
Usage metrics

mdoering commented 7 years ago

Is there room for showing the major taxonomic distribution for each dataset? At least for the backbone I would like to see some exclusive stats like this: https://github.com/mdoering/backbone-stats/blob/master/README.md

Could also be the header of the taxonomy tab

peterdesmet commented 7 years ago

Or a taxonomic partition: http://datafable.com/gbif-dataset-metrics/ It is similar to a sun diagram (i.e. it shows percentage AND hierarchy), but with advantage that all text is horizontal.

kbraak commented 7 years ago

Below are some suggestions on how to improve the display of usage metrics and quality metrics on the dataset page (not what specific metrics are important to show):

The metrics tab on the new dataset page is not shown for checklists, e.g. https://demo.gbif.org/dataset/8d431c96-9e2f-4249-8b0a-d875e3273908 (screenshot below) Of course checklists can have associated occurrences and contribute records to downloads from GBIF.org. Note this has been flagged in issue https://github.com/gbif/portal-feedback/issues/107
The metrics tab on the new dataset page only displays usage metrics so far. In my opinion, I think usage metrics and quality metrics should be kept separate. Note this idea has been captured in issue https://github.com/gbif/portal-feedback/issues/108. There are plenty of good ideas building for what usage/alt metrics to show in issue https://github.com/gbif/portal16/issues/245
When a checklist has associated occurrences, it is easier to understand what quality metrics relate to what types of records when the metrics are separated by record type. The current portal's checklist metrics page does this nicely, e.g. http://www.gbif.org/dataset/8d431c96-9e2f-4249-8b0a-d875e3273908/stats (screenshot below) The new portal mixes quality metrics from different types of records on the dataset homepage, e.g. https://demo.gbif.org/dataset/8d431c96-9e2f-4249-8b0a-d875e3273908

I have already highlighted that interpretation issues are missing in the new dataset page and why they are important to retain in issue https://github.com/gbif/portal16/issues/300

Screenshots:

screen shot 2017-03-02 at 13 48 22

@dschigel

mdoering commented 7 years ago

In addition to the backbone overlap percentage shown in our current portal it would be nice to also mention the number of names used from a checklist as a primary source to build the backbone.

Our API returns for checklists some richer metrics than currently shown on the portal that we might want to surface: http://api.gbif.org/v1/dataset/8d431c96-9e2f-4249-8b0a-d875e3273908/metrics

The count by extensions is interesting for any kind of dataset

DMNHverts commented 7 years ago

As a data provider we're interested in stats that would help support requests to administrators/grantors for additional resources (or continuation of existing support) for collections care and digitization of collections data. So we're interested in how many people are retrieving data from our datasets (# searches, # unique IP addresses; these will all be an index, of course, not a true count as one individual may refine a search multiple times or interact via different IP addresses) and also how much data are they viewing or downloading. Include data accessed by API or other non-traditional means. We have multiple datasets on GBIF (different collections)- stats need to be by dataset as support is by collection. Vertnet supplies nice reports http://tools-usagestats.vertnet-portal.appspot.com/reports/c21cd435-718a-4069-b503-776bf0e22b96/201610/. They also include the country of origin of the request and a list of the search terms used and the number of records returned for each. This could be helpful in focusing efforts to improve data quality (e.g. if Colombia is in demand focus georeferencing efforts there). BISON breaks their numbers down by how the interaction occurs (screen, download, API). It would be helpful to display stats down by month and also have an option to do year to date and full year. We're looking for summary stats- the report to the administrator might say "123 searches on GBIF resulted in 123,456 records viewed or downloaded for May 2017." Stats should remain available for previous months and years. Depending on the complexity of the statistics a download mechanism might be helpful. I also heard some interesting ideas about statistics related to publication of data (or use of data in published analyses)- that would be fabulous. End users are supposed to acknowledge the data source (collection) when publishing and let the collection know about the publication but this often doesn't happen. Published science is, of course, a great measure of the value of all the effort we are putting into digitization but also probably one of the hardest to track.

jlegind commented 7 years ago

I want to echo whats already mentioned about a high level view of the taxonomy in a dataset. This can help users/publishers discover irregularities or interpretation issues with the data. This could perhaps be under the stats tab to avoid clutter on the DS page.

The occurrence dataset top panel (under the header) contains data similar to what could be in a 'certificate of quality' in that they show completeness. Now there could be more measures inserted into this like basis-of-record interpreted %, country explicitly stated %, recorded_by % or others to-be-determined.

Regarding the stats tab, this could be the place for publisher facing statistics such as downloads by a time period, which could be user defined. Number of records downloaded (same as above) and number of distinct users.

jlegind commented 7 years ago

Also in the case where this is possible, please add a link to the cached EML document.

acbentley commented 7 years ago

Usage statistics are of great importance to providers/collections as these are used to show use of collections and advocate for collections funding and further use. The statistics that we get from Vertnet through GitHub are the "poster child" for what we would like to see - not only individual stats but cumulative stats by month and year for yearly reporting - number of searches, number of downloads, number of records downloaded, from what countries, unique search parameters etc. But, even more important than that is that these stats are comparable across the aggregators so that they can be summed for use in annual reports where a single figure is needed. At present it is virtually impossible to get a single figure as the aggregators are all showing different metrics. I also believe that measures of uniqueness for providers would be a great advocacy tool allowing collections to show what, if anything, they are uniquely contributing to the aggregator on a taxonomic and/or geographic level. What do I have in my collection that no one else has that I can crow about in my annual report or outward facing profile of my collection that would attract use and funding? It would also allow for highlighting of irregularities in taxonomy or geography or possible errors that could be fixed. While I am at it, I think that aggregators could and should do a better job of porting data cleanup metrics back to the collections so that we can fix these in the original data source. They need to be provided in a clear and simple mechanism that allows me to see what has been changed, why it has been changed and what it has been changed to so that I can bring up records in my database and fix the errors or do so in some sort of batch method.

onco-p53 commented 7 years ago

I support retaining the "INTERPRETATION ISSUES" metric, as a collection curator to find potential data entry errors and correct them.

mdoering commented 7 years ago

A metric for checklist datasets I have often been asked for is the number of species that are uniquely present in this dataset and contribute to the addition of the GBIF backbone. It shows value.

E.g. this shows the names that the PBDB contributes to the GBIF Backbone: https://www.gbif.org/species/search?dataset_key=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&constituent_key=c33ce2f2-c3cc-43a5-a380-fe4526d63650&advanced=1

MortenHofft commented 6 years ago

from http://dev.gbif.org/issues/browse/POR-1539?jql=project%20%3D%20POR%20AND%20status%20%3D%20Open

For Dataset Metrics I would say very useful additional fields would be: Number of species (or terminal taxa) Date range Countries of origin (and/or geographic range by lat/long) I know this is information that may be included in the metadata fields, but an automated view if easily generated would add valuable context to the dataset overview

Add multimedia metrics to dataset page from Jira

from jira Dataset page stat overview does not count all kingdom 'unknown' records

onco-p53 commented 6 years ago

I see that www-old.gbif.org/ is redirecting to www.gbif.org/ now, so we can no longer get to the old stats. Is there an alternative way to access these?

MortenHofft commented 6 years ago

@onco-p53 No there isn't. And a good argument that www-old should stay a little longer until the stats are there. We will look to that in the new year. Thanks for reporting.

qgroom commented 6 years ago

We are looking for summary stats that we can use in our individual and institutional reports, but we need both total and annual figures. So simple numbers such as the number of downloads would be among the most useful. More complicated statistics that might be useful are ways to demonstrate our geographic reach. It might be hard to pin down the location of people downloading data, but data related to the location of the data they are downloading might be enough.

acbentley commented 6 years ago

I would second this call with the added request for information regarding the countries that searches were initiated from. Essentially what we would be looking for is monthly aggregations of statistics much like the Vertnet model that has yet to come back on line. As such, at present NONE of the aggregators are displaying aggregated usage statistics for collections data which once again leaves our annual reports short of any data to show our reach and advocate for our collections.

Andy

A  :             A  :             A  :

}<(((°>.,.,.,.}<(((°>.,.,.,.}<)))_°> V V V Andy Bentley Ichthyology Collection Manager University of Kansas Biodiversity Institute Dyche Hall 1345 Jayhawk Boulevard Lawrence, KS, 66045-7561 USA

Tel: (785) 864-3863 Fax: (785) 864-5335 Email: abentley@ku.edumailto:abentley@ku.edu http://ichthyology.biodiversity.ku.edu http://ichthyology.biodiversity.ku.edu/

SPNHC Past President http://www.spnhc.org http://www.spnhc.org/

                       :                 :
A  :             A  :             A  :

}<(((°>.,.,.,.}<(((°>.,.,.,.}<)))_°> V V V

From: Quentin Groom notifications@github.com Reply-To: gbif/portal16 reply@reply.github.com Date: Friday, January 12, 2018 at 1:48 AM To: gbif/portal16 portal16@noreply.github.com Cc: Andrew Bentley abentley@ku.edu, Comment comment@noreply.github.com Subject: Re: [gbif/portal16] dataset metrics (#138)

We are looking for summary stats that we can use in our individual and institutional reports, but we need both total and annual figures. So simple numbers such as the number of downloads would be among the most useful. More complicated statistics that might be useful are ways to demonstrate our geographic reach. It might be hard to pin down the location of people downloading data, but data related to the location of the data they are downloading might be enough.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/gbif/portal16/issues/138#issuecomment-357166641, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHJ4pXDyqLe5_Dyf9QcqSlv0LLgJ9uMBks5tJw5ngaJpZM4KsTy1.

kcopas commented 6 years ago

It's no standalone substitute for the statistics under discussion here, but publishers can view the number of DOI-based citations of their datasets—usually as part of a multi-dataset search result—on both publisher pages and dataset pages. They show totals, but using the date selectors, you can see annual counts, e.g. 26 of the 55 citations for KU's datasets are from 2017. There’s more about the whole literature-tracking/DOI-citation-and-linking system here, here and here, with more to follow soon.

Again, pointing this new feature out is not intended to stand in the way of restoring the old ones.

qgroom commented 6 years ago

I happened to discuss this with our director and an important point he raised is that the use metrics should be stable, because he wants KPIs he can report on every year.

gbif / portal16

dataset metrics #138

Screenshots: