Change country score into open data indicators: open, restricted, closed, unknown

pzwsk commented 6 years ago

THIS IS A PROPOSAL UNDER DISCUSSION [TO BE AGREED ON]

What are the main questions this proposal is addressing?

What is the status of open data for resilience country by country?
Which country has the most open datasets? the most closed datasets
How is my country ranking compared to neighbors in the Region?
What are the countries where we still need to do research on datasets?

For each dataset submitted, evaluate its open data status according to the following indicators:

Criteria	Open Data	Restricted	Closed
Does the data exist?	YES	YES	Y/N
Is the data publicly available?	YES	YES
Is the data available in digital form?	YES
Is the data available online?	YES
Is the metadata available online?	YES
Is the data available in bulk?	YES
Is the data machine-readable?	YES
Is the data available for free?	YES
Is the data openly licensed?	YES
Is the data provided on a timely and up to date basis?

Then for each country:

number of open data = number of datasets submitted being classified as open data
percentage of open data = number of open data /JOINT(number of datasets submitted, total number of datasets considered)
same for closed and restricted
number of unknown = number of datasets without any dataset submission
percentage of unknown = number of datasets without any dataset submission / total number of datasets considered

Then category and hazard filters apply

Default sorting works in following order: number open data THEN number restricted THEN number closed THEN number unknown

Option 1 with Split Bars

Option 2 with Stacked Bars

See #264 #271 and #270 for background

pzwsk commented 6 years ago

From @oncletom in #221 : what to think about a 95% score with 0% open data?

thom4parisot commented 5 years ago

Here is a sample of the API response for /api/country_scoring/ at the moment:

{
  "keydatasets_count": 36,
  "fullscores_count": 81,
  "datasets_count": 176,
  "countries_count": 19,
  "countries": [
    {
      "score": "32.4",
      "fullscores_count": 9,
      "datasets_count": 12,
      "country": "AU",
      "rank": 1
    },
    // ...
  ]
}

This is how I understand we provide the values for each column, per country:

Open Data: fullscores_count
Restricted: datasets_count
Closed: the information is not exposed yet and cannot yet be calculated on the client side based on the above sample
Unknown: datasets_count - keydatasets_count

Is that it?

pzwsk commented 5 years ago

Feedback from @vdeparday

Split bars might confuse users as they may think it is possible to have 100% for all bars

Then, stacked bars might be better

thom4parisot commented 5 years ago

👍 understood.

If there is a sentiment of progression, I'd use the colour contrast to convey this feeling. If it's about the openness — more contrast = open, less contrast = not open.

vdeparday commented 5 years ago

Option 2 looks much better I think. It is easier to understand and compare. I am just wondering about the visibility of the legend as you scroll down, you will keep it above? And may be we can add tooltips on hover of the stacked bar.

pzwsk commented 5 years ago

I suggest to put some text below each indicator to explain. @gracedoherty can you review language especially? Thanks

pzwsk commented 5 years ago

Hi @oncletom re your comment made on Nov 21, we need to discuss with @nastasi-oq

I am going to open a new issue as this may involve some important changes in BE and API.

We will continue to use this issue as main umbrella issue.

gracedoherty commented 5 years ago

Open Data free to access, use and share

Restricted technical, legal or cost restrictions

Closed access, use and sharing not permitted

Unknown more information needed

pzwsk commented 5 years ago

Thanks, I am revising a bit based on last modifications regarding definition of closed:

Open Data free to access, use and share

Restricted technical, legal or cost restrictions

Closed access not permitted or does not exist

Unknown Missing information. Submission needed

pzwsk commented 5 years ago

Feedback from @vdeparday

Split bars might confuse users as they may think it is possible to have 100% for all bars

Then, stacked bars might be better

Need also to be decided in terms of easiness of implementation. We may also have unknown column independent from others.

pzwsk commented 5 years ago

Last proposal

Open Data, Restricted and Closed in one stacked progress bar Percentage (Open Data) = number of open data / 100 Percentage (Restricted) = number of restricted / 100 Percentage (Closed) = number of closed / 100

More rigorous option would to replace 100 by the maximum of datasets for a country (number of datasets submitted + number of key datasets without submission).

Unknown in another progress bar Percentage =number of key datasets without any submission/total number of key datasets

nastasi-oq commented 5 years ago

not blocking thought, just a consideration: if we stop to take in account ThinkHazard! there will be a bias for countries with less perils (because there aren't data for not interesting perils where not needed) compared with others.

pzwsk commented 5 years ago

Yes, true but

low or very low level in TH! does not always mean zero possibility to have an hazard in the country;
hazard and category filters should help to refine analysis - need to rethink UX/UI here for instance by better communicating TH! level instead of removing some hazards;
it is very difficult to avoid some bias of comparisons at country level but for majority of data it is ok;

On Tue, Feb 5, 2019 at 9:42 AM Matteo Nastasi notifications@github.com wrote:

not blocking thought, just a consideration if we stop to take in account ThinkHazard! there will be a bias for countries with less perils (because there aren't data for not interesting perils where not needed) compared with others.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GFDRR/open-risk-data-dashboard/issues/305#issuecomment-460555472, or mute the thread https://github.com/notifications/unsubscribe-auth/ACRKx1cEhCCpmyoc07V_7rp__9SqcsVuks5vKUPhgaJpZM4YS9FC .

nastasi-oq commented 5 years ago

@pzwsk , @oncletom @CIMAManuel UPDATED WITH SORT On be_scoring-new branch we have a working experimental version of the new scoring system (without ranking). It is already installed on our experimental instance (without any kind of filters, currently) at: https://exp.riskopendata.org/api/scoring_new/

Format Description

[
...
["AU", 17, 4, 0, 25],
["GN", 0, 0, 0, 36],
["JM", 0, 0, 0, 36],
...
]

Where the key is the wordbank id of the country and the four columns are:

numb of Open Data
numb of Restricted
numb of Closed
numb of Unknown

The sum of all 4 values produces the denominator described by pzwsk

More rigorous option would to replace ...

NOTE: as the old version, also this has pre-computed quantities on save of dataset and that could be forced using the already working Scoring Update button.

thom4parisot commented 5 years ago

Thanks @nastasi-oq, it's great to have the value.

I was somewhat expecting to have these values as part of /api/country_scoring/ and /api/country_scoring/:country. I don't see the value in doing another API call for a basic information related to a country.

A second point is related to the data format. Unnamed field sounds fragile and non-very explicit. I prefer explicit, also because it avoids to write code to consume the data.

Would it be possible to have an output which looks like this for /api/country_scoring/…

{
    "keydatasets_count": 36,
    "fullscores_count": 81,
    "countries_count": 20,
    "datasets_count": 180,
    "countries": [{
            "rank": 1,
            "fullscores_count": 7,
            "score": 31.8,
            "datasets_open_count": 7,
            "datasets_count": 12,
            "datasets_restricted_count": 3,
            "datasets_closed_count": 2,
            "datasets_unknown_count": 22
            "country": "AU"
        },
        {
            // ...
        }
    ]
}

… and like this for /api/country_scoring/:country?

{
    "keydatasets_count": 36,
    "fullscores_count": 7,
    "scores": [ ... ],
    "datasets_count": 12,
    "score": 31.8,
    "datasets_open_count": 7,
    "datasets_restricted_count": 3,
    "datasets_closed_count": 2,
    "datasets_unknown_count": 22
}

It's exactly the same data, but in existing routes.

nastasi-oq commented 5 years ago

@oncletom it was just a preview to undestand if the data are consistent with what we want. About the previous syntax fullscores_*, score, datasets_open_count fields are still used ? I started from scratch to check performances too but if I must include also expansive data gathering we fall back to slow queries. I start to rearrange output in a more proper structure.

thom4parisot commented 5 years ago

OK, understood 👍 (I thought it was the final proposal).

I suspect fullscore_count can be replaced by datasets_count on the frontend side, to represent submitted datasets (unless if there is a meaningful different with fullscore_count).

What I understand is score will remain but will be computed differently, as of https://github.com/GFDRR/open-risk-data-dashboard/issues/415#issuecomment-458580393.

I will adjust #424 to follow the rework of the API, when you next update exp.riskopendata.org.

When we're both okay, then your work can go to dev and #424 can be merged
Then when we feel everything is well wired, it can go to production

What do you think?

nastasi-oq commented 5 years ago

@oncletom this is the current outcome from api/scoring/ (the old one is accessible at api/scoring_old/) on exp.:


{
    "datasets_count": 473,
    "keydatasets_count": 36,
    "countries_count": 247,
    "countries": [
        {
            "datasets_closed_count": 5,
            "datasets_restricted_count": 29,
            "datasets_unknown_count": 12,
            "rank": 1,
            "datasets_open_count": 5,
            "datasets_count": 39,
            "score": 33.5,
            "country": "YF"
        },
        {
            "datasets_closed_count": 3,
            "datasets_restricted_count": 32,
            "datasets_unknown_count": 11,
            "rank": 2,
            "datasets_open_count": 2,
            "datasets_count": 37,
            "score": 31.7,
            "country": "AL"
        },
        ....
    ]
}

thom4parisot commented 5 years ago

Amazing, it looks good, thank you!

I will have a look at it tonight so as you can have feedbacks for tomorrow. Although I can't see what would be necessary to change at this stage.

pzwsk commented 5 years ago

Hi, great to see we are converging on this

My comments:

the proposal is to get rid of scoreand rankfields in the end;
I would also remove countries_count (I think we still use it for home page indicator though?);
would be good to have a look at how indicators are computed, could you point us to current code?

nastasi-oq commented 5 years ago

the proposal is to get rid of scoreand rankfields in the end;

Already there, we can use it as is (are consistent, currently with the score) and change to a final version late.

I would also remove countries_count (I think we still use it for home page indicator though?);

As you prefer, @oncletom give me an LGTM and I proceed.

would be good to have a look at how indicators are computed, could you point us to current code?

The pre-computed part, instead is done in get_score_calculate_new class method: https://github.com/GFDRR/open-risk-data-dashboard/compare/master...be_scoring-new#diff-358ba6dc1c7f31b62296c6c484e774e7R352

The biggest part of the job is done in the all_countries_new class method. https://github.com/GFDRR/open-risk-data-dashboard/compare/master...be_scoring-new#diff-2fc7c76ad15b7f095e4e9b3cf2aeafbfR895

thom4parisot commented 5 years ago

countries_count is still in use, but only via the /api/stats route

rank is not used anymore

score can be replaced by a custom sorting method, client-side (at the moment, it is in use to stort the "Open / Restricted / Closed" column).

GFDRR / open-risk-data-dashboard

Change country score into open data indicators: open, restricted, closed, unknown #305

Format Description