BiologicalRecordsCentre / irecord-extras

A small Drupal module to provide iRecord-specific functionality
0 stars 0 forks source link

Recorder metrics #10

Open johnvanbreda opened 3 years ago

johnvanbreda commented 3 years ago

Initial request discussion:

This relates to: https://github.com/BiologicalRecordsCentre/ABLE/issues/275

and this paper: https://www.nature.com/articles/s41598-020-67658-3

The question is whether we would need the metric pre-calculated and cached or whether live ElasticSearch reports would be good enough.

Priority metrics from the paper are: Activity ratio “The proportion of days on which the volunteer was active in relation to the total days he/she remained linked to the project”35. The time a participant is linked to the project is taken as the number of days between the first and last observation, once days outside of the summer periods are excluded.

Active area size This is the area of a 95% kernel density polygon fitted to a participant’s observations. This describes the spatial extent of the majority of a participant’s recording activity.

Proportion of taxa recorded This is the number of unique taxa recorded by an individual as a proportion of the total number of taxa recorded by all participants.

Rarity recording Taxa are ranked according to the number of records in the entire dataset from highest to lowest and scaled to 100. The most-rarely reported species has the value 100 and the most-commonly reported a value of 1: this is the species’ rarity value. The Rarity Recording metric is the median rarity value, across all records for the participant, minus the median rarity value across all observations in the dataset. Negative values of this metric show that the participant submits records of common taxa more frequently than expected, while positive values mean that the participant submits records of rarer taxa more frequently than expected.

And a response:

This will take a bit of thinking through. I think that pre-calculation will be necessary as it will then allow you to integrate the values in other reports. Some elements of pre-calculation will be essential anyway – e.g. for rarity recording we’ll need to prepare the rarity values for species. We may also find that pre-calculation allows us to perform some calculations against Elasticsearch and some against PostgreSQL or using R, for example the proportion of taxa recorded is best done in Elasticsearch using a request to count total recorded species concepts, then a request to count the same broken down by user. The active area size may require a combination of PostgreSQL and R.

I think the solution needs a custom module which combines PHP code, SQL queries and Elasticsearch, then calculates the value and stores it in a recorder_metrics data table, keyed by website_id/user_id.

Activity ratio – an Elasticsearch request can retrieve the number of recording days within the summer season, plus the min and the max recording day for each user. Then PHP code will need to calculate the number of days between the min and max excluding the out of season days and calculate the ratio to store for each user.

Activity area size – can the module just provide a space to store this information and a R-script runs to calculate it?

Proportion of taxa recorded – an Elasticsearch request to calculate the total number of taxa recorded by the project, followed by a separate request to get the unique species count by user.

Rarity recording – an Elasticsearch request to list species in order of record count. PHP can then assign a score from 100 to 1 for each. This will need to be pushed into the Elasticsearch index. Then a request which calculates the median for the dataset. Finally a request which calculates the median rarity for the records belonging to each user.

johnvanbreda commented 3 years ago

@DavidRoy do you have an example project (e.g. an iRecord activity, or an app) that you'd like to try this against? I suggest as a starting point I just write some example queries and Elasticsearch requests so we can "play" with the outputs before formalising it into a module.

DavidRoy commented 3 years ago

@johnvanbreda Survey 101 please as a test of this

DavidRoy commented 3 years ago

@johnvanbreda would also be good to enable this to work at the level of an indivual's recording of a recording scheme (or taxon_group), e.g. John's metrics for recording butterflies, ladybirds, plants....

johnvanbreda commented 3 years ago

Thanks @DavidRoy. Is the intention to be able to show a complete dataset of all recorders for a project (like a league table), or is the intention just for each recorder to be able to view their own statistics?

DavidRoy commented 3 years ago

@johnvanbreda Good question. This is mostly about the individual's metrics but the logical extension is to see how an individual compares with everyone else. So both please. Could be tackled in stage, with individual metrics done first? Then work out how to process all recorders? Maybe needs some cache tables

johnvanbreda commented 3 years ago

@DavidRoy I've managed to write code which generates metrics for an list of individuals on the fly from Elasticsearch. Calculating for individual users, or a short list of users, is very fast as long as we pre-calculate and cache the species list and associated records count for the project (in this case survey 101). We may of course need to pre-calculate the results if comparing across all the users of the app for example.

The only metric I've not tackled is active area size. We could try to use R to do these calculations but that will require an offline dataset for R to work against. If the calculation method can be easily described I could look at the possibility of calculating from within PostGIS but I suspect it will be slow.

Given the above, what would be a suitable output for this project? I.e. how would the user access this information and from where?

DavidRoy commented 3 years ago

Thanks @johnvanbreda. The initial use case is a richer summary report within the iRecord butterflies app. e.g. an extension of https://github.com/NERC-CEH/irecord-butterflies-app/issues/18

johnvanbreda commented 3 years ago

@DavidRoy @kazlauskis I've now added a new end-point to the iRecord Indicia API at /api/v1/advanced_reports/user-stats. The advanced-reports path is intended to group together reports that have custom processing included, i.e. not just an Elasticsearch or PostgreSQL request. The user-stats end-point specifically returns recorder metrics information designed to replace the existing butterfly app user metrics (https://github.com/NERC-CEH/irecord-butterflies-app/issues/18) as well as include the metrics described here.

The end-point should be accessible to the app in the same way that it uses the indicia_api module (with the same authentication). You can pass a survey_id or group_id get parameter to limit it to a survey - but note this is designed for surveys with a limited set of species rather than a general recording survey (due to the need to calculate rarity data across the entire dataset). E.g. https://www.brc.ac.uk/irecord/api/v1/advanced_reports/user-stats?survey_id=101 which gives the following response:

{
  "myTotalRecords":2529,
  "projectRecordsCount":510584,
  "projectSpeciesCount":111,
  "myProjectRecords":2475,
  "myProjectSpecies":52,
  "myProjectRecordsThisYear":3,
  "myProjectSpeciesThisYear":3,
  "myProjectSpeciesRatio":46.8,
  "myProjectActivityRatio":38.9,
  "myProjectRarityMetric":0
}

In this context, "project" means the filter you applied via the survey_id or group_id parameter. "myTotal" means the the user's records within the entire set of reporting data for iRecord.

@BirenRathod presumably this new code should be added to the Drupal 8 version of the module as well?

BirenRathod commented 3 years ago

@johnvanbreda yes. we need on Drupal 8/9 too.

johnvanbreda commented 3 years ago

Ok, @BirenRathod we'll need to do this when we get back onto the Drupal 9 migration task.