Sources of Open AI models must also be open?

iperdomo commented 1 year ago

Does the sources of an Open AI model also need to be open in order to comply with the standard?

A couple of real examples:

Voice recording from children used for training a speech-to-text (STT) model. In this case, a laborious and complex consent with parents (and/or tutors) must be in place to release the sources used in the training.
Models where several datasets including Google trends (search) and NASA datasets are used and there are no open alternative to those.

christer-io commented 1 year ago

We need to be very clear on this. Models must be under an open license.

jwflory commented 1 year ago

Just curious. Are we talking about data or are we talking about models? I don't think we can have a conversation about models being under an open license if the data they are trained on is proprietary or limited access. I think AI models fall into a gray area with the current DPG Standard because models are, in many ways, an intersection of software and data. You can't have an AI model with just one; you need both ingredients.

jstclair2019 commented 1 year ago

I think from the software side AI/ML still fits within the DPG Standard. While I agree with @jwflory about the "grey area" I envision a submitter presenting a candidate DPG that is an open source software library. Even if it's trained on proprietary data sets, HOW the software libraries work and how they are updated (especially hard dependencies like another third party library).

iperdomo commented 1 year ago

We have another candidate that uses datasets behind a login wall and a custom dataset license

https://github.com/DPGAlliance/publicgoods-candidates/pull/1363

We use MIMIC-III. As MIMIC-III requires the CITI training program in order to use it, we refer users to the link

Source: https://github.com/onefact/ClinicalBERT/blob/main/README.md#datasets

The license for MIMIC-III can be found at: https://physionet.org/content/mimiciii/1.4/ - PhysioNet Credentialed Health Data License 1.5.0

jaanli commented 1 year ago

Thank you for this discussion! Echoing @iperdomo -- there is a moral and ethical dilemma here.

For example, I have Estonian citizenship, and am a visiting professor at University of Tartu, where I teach students and faculty how to use ClinicalBERT (https://arxiv.org/abs/1904.05342). The course I teach: https://courses.cs.ut.ee/2023/chatGPT/spring

Some students and faculty have re-trained this ClinicalBERT in the Estonian language to help the department of health move to value-based care for the health system there.

I will try to articulate the moral and ethical issue here that is consequential to requiring that protected health information be open source (e.g. using these license types: https://github.com/DPGAlliance/publicgoods-candidates/blob/main/help-center/licenses.md#data):

(1) countries like Estonia have homogeneous genetic ancestry

(2) countries like Estonia also have few people of color -- e.g. this reference states there were 414 people of African descent or Black Europeans in 2011 (https://www.enar-eu.org/wp-content/uploads/estonia_fact_sheet_briefing_final.pdf).

(3) Doctors not having awareness around diseases with genetic etiology such as sickle cell anemia or polycistic ovarian syndrome can cause death.

(4) Doctors CAN use tools like ClinicalBERT (https://arxiv.org/abs/1904.05342) to help make decisions about patients for whom they cannot access data due to legal, moral, and ethical standards (for example, a doctor in Estonia cannot expect a hospital in New York City to share data on Black, Black European, or African-American patients in the electronic health record -- and this could violate several laws).

(5) We have support from the National Institutes of Health, who are conducting the largest longitudinal study in the history of the United States, to solve this problem: https://drive.google.com/open?id=1Si323cuMQp68ilgsymi_KZu1FXqXFT3c&authuser=jaan%40onefact.org&usp=drive_fs -- this letter states that ClinicalBERT can be trained on the entirety of the researchallofus.org electronic health record. This health record contains up to 78,040 people who self-report their race/ethnicity as Black, African American or African. This is over 100 times more Black people than are in Estonia.

Not using this data risks exposing under-represented groups in many countries to having medical decisions made by themselves or their care teams, hospital systems, health systems, and governments not have the most information possible, delivered by open source software and AI (ClinicalBERT is Apache 2.0-licensed).

In my experience working across several large academic medical centers and hospitals in the United States and Europe, scenarios like this are exceedingly common, and open source AI tools like ClinicalBERT are one way to share knowledge for medical professionals and health systems to deliver valuable care to their population.

Does this use case make sense? And is it clear how the requirement that datasets be open source can lead to a situation where people do not have access to potentially life-saving care algorithms, in the case that they happen to be under-represented in the health system and clinical data repository their care team uses to inform decision-making?

Happy to clarify, cite, discuss any of the above. My email is jaan@onefact.org if easier.

kjetilk commented 1 year ago

While admittedly having not had time to read recent discussions in detail, I just wanted to point out the work that is going on within the Debian Project on this topic: https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst

ricardomiron commented 1 year ago

At the DPGA we are currently hosting a community of practice (CoP) on AI systems as DPGs. This CoP will run for the next couple of months and one of its focus areas will be to inform the DPG standard to enable a well-grounded assessment and vetting process of AI-based solutions that are submitted to the DPG Registry.

Some of the things we will be discussing are (but not limited to):

What does an open AI model/ system mean as a stand-alone DPG category and its core components.
Requirement to have the sources and/ or training datasets open and their challenges.
Best practices, open standards, and required documentation for AI.
Other privacy, security, and ethical considerations.

We hope that at the end of this CoP, we can get to a final conclusion around this issue (#148) but also have a stronger understating of the requirements for AI models/ systems as DPGs, so feel free to share any thoughts or resources that can help guide this conversation.

jstclair2019 commented 1 year ago

Thanks @ricardomiron I'd like to know if CoP participation is still open. I've been involved in open source efforts for vulnerability management in AI that I think need to be included for consideration as well.

ricardomiron commented 1 year ago

Thanks @ricardomiron I'd like to know if CoP participation is still open. I've been involved in open source efforts for vulnerability management in AI that I think need to be included for consideration as well.

@jstclair2019 the CoP has already started but I'll send you a DM with more details.

Also, to be more specific these are the current requirements to consider open AI models for the DPG vetting process:

The model itself (source code for model creation, training, optimization, etc) must be under an approved open-source software license (OSI).
The data sources should be explicitly mentioned, and the training datasets must be publicly available under an approved open data license.
Training of the model is expected to be carried out using only these open datasets, alongside proper documentation for reproducibility.

DPGAlliance / DPG-Standard

Sources of Open AI models must also be open? #148