Incorrect classification and licensing for some Data DPGs

ldodds commented 6 months ago

Apologies if this is not the correct replace to report issues with entries in the registry. If so, please point me in the right direction.

I've recently taken a look at the Data category of the DPG Registry to understand what datasets have been classified as DPGs.

I think there are some misclassified entries and at least one licensing mistakes. Some notes in my blog post linked above, but:

Three of the data are actually Software, I believe rather than data. These are Crosscut, Dicra, and Doptor Open Data. They all provide some way to organise, access or work with third-party datasets but they don’t provide any original datasets.

The ability for Crosscut to licence data as CC0 may also merit some review as all of their source datasets require attribution and one (OSM) requires a share-alike licence.

The Open Terms Archive submission indicates that the license for the data is ODC-BY. But this is a mistake as reviewing each of the downloads I found that the licenses are ODbL.

Project AEDES is classified as an AI Model, Open Content and Open Data. However again, it doesn't seem to provide any original data and is a project for building a predictive model.

Hope this feedback is useful!

ricardomiron commented 6 months ago

Hi @ldodds ,

This repo is mainly now a public archive and is no longer in use since we have moved to a different system to review DPG applications. Our main place for discussions is the DPG Standard repo.

Very interesting perspective and you have highlighted some of the challenges around the open data category we are sorting through and rethinking how the DPG Standard applies to data. Until now we have allowed projects to be classified under multiple categories which have also made it difficult for final users to understand when the documentation refers to one category or the other in some cases, we are soon to change this, and likely for some like Project AEDES.

Addressing some of the comments: Crosscut, Dicra, and Doptor Open Data provide both software functionalities/ services related to data AND access to data.

Crosscut commercial and free tier services were not reviewed as part of their open data submission, we mainly focused on the 12 publicly accessible datasets available that claim to be relevant to Global Health. The datasets show some settlements in Africa, catchment area boundaries for each settlement, provide a name for the settlement, and provide population and building estimates. As far as they explained, this data is owned by them and was produced in collaboration with WHO and licensed appropriately.
The implementation of Dicra (https://dicra.undp.org.in) allows you to visualize and download climate and agriculture data collected in different regions in India, this data was collected by UNDP India, the owners of this DPG.
Doptor Open Data is owned by a2i (Government of Bangladesh) and they are the producers of the data as well, as documented under indicator 3. "Clear ownership" and accessible through their API.
For Open Terms Archive, can you clarify which downloads you are referring to? As far as I can tell, all the repos containing terms/ policies are under ODC-BY 1.0 license (for example their main collection). They might have added additional collections under ODbL, which is also an approved open data license that can get updated/ added to their information if that's the case.

There's a caveat, all this information is provided by the product owners themselves, and although verified to some degree by the DPGA, they remain responsible for any claims or misrepresentations.

But happy to connect and chat more about this!

ldodds commented 5 months ago

Hi @ricardomiron

Thanks for following up on my comments. I'll take a look at the DPG Standard repo, I'm really interested in how the DPG registry evolves and assessing datasets against standards & guidelines. Happy to chat further, you can reach me at leigh dot dodds at gmail.

Firstly, my apologies I've clearly made mistakes in my review. I will update my blog post with corrections.

Dicros, Doptor and Crosscut do clearly make data available. At least part of my mistake was focusing too much on the github repos associated with each project. My assumption was that those repos would hold both the code and the datasets associated with those public goods and that the websites were public deployments of that code.

That's obviously not the case:

CrossCut is not an open source service and (despite the labelling of the DPG) CrossCut is not the DPG itself. The DPG is actually the "Crosscut Example Datasets". While I had seen those I'd assumed they were just examples and not "production" datasets and wrongly focused on trying to evaluate the CrossCut service.
Dicra and Doptor both provide source code for at least some of their infrastructure, but that is separate from the data services they provide which, while based on that software, provides access to datasets that are not available elsewhere (e.g. as a raw data files in the github repos). So the datasets and the code feel like distinct things.

Being able to more clearly distinguish between some software that may be used to host and serve datasets, and the DPGs that might be provided using that software would be helpful.

My question re: the CrossCut licensing is that the ODbL has some provisions which allow partial extracts to be freely licensed, but derived data and larger extracts trigger the sharealike provision. We explored a similar use case to CrossCut in a project I lead a few years ago. I understand it's not your role to police this though!

In the case of Open Terms Archive, I went to their website, clicked on the "Datasets" link in the navigation bar and then checked the "Download dataset" link for each of them. These all refer to ODbL and not ODC-BY.

Confusingly, the "main collection" you linked to is in a repo called contrib-versions. The licence file available from your link does refer to ODC-By but from the website you're taken to the latest release page for the same dataset which says ODbL. As a user that's how I'd expect to discover the datasets.

I hope the feedback is useful and apologies again for the mistakes.

DPGAlliance / publicgoods-candidates

Incorrect classification and licensing for some Data DPGs #1712