airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Create a `Contributor` object to use throughout the schema #552

Closed bussec closed 8 months ago

bussec commented 3 years ago

The following point was moved here from #530:

Currently we have multiple data structures in the schema that refer to people contributing to a study (ex. study subjects):

bussec commented 3 years ago

We could use the following taxonomy to describe a persons role in a project:

https://casrai.org/credit/

bcorrie commented 2 years ago

So each of the above are replaced with Contributor

bcorrie commented 2 years ago

We could use the following taxonomy to describe a persons role in a project:

https://casrai.org/credit/

How would you see this working?

Contributor : {
    person: URI {
        label: string, (e.g. "Brian Corrie")
        id: string, (e.g. "ORCID:0000-0003-3888-6495")
    },
    institution: URI {
        label: string (e.g. "Simon Fraser University")
        id: string, (e.g. "ROR:0213rcc28")
    },
    credit: URI {
        label: string (e.g. "Data curation")
        id: string (e.g. "CRT:f93e0f44-f2a4-4ea1-824a-4e0853b05c9d")
    }

Example is from: https://credit.niso.org/contributor-roles/data-curation/

bcorrie commented 2 years ago

This would be minimal, and complete, but would require lookup every time one wanted to actually use anything.

Contributor : {
    orcid_uri: string, (e.g. "ORCID:0000-0003-3888-6495")
    credit: string (e.g. "CRT:f93e0f44-f2a4-4ea1-824a-4e0853b05c9d")
}

Not a fan of this...

schristley commented 2 years ago

I kinda feel if we're going to go this far in adding a Contributor object, we should go all the way and allow for any number of contributors to be attached to a study. Thus, not restrict ourselves to only those 3 categories (study contact, collected, submitted).

Plus, there's a little bit of disconnect between the contributor roles in CREDIT and what MiAIRR wants. CREDIT roles indicate what contribution that people made, but none of the roles exactly match the roles we want, i.e. a study contact, the person who collected the data and is legally response for the data, and the person who submitted the data. The "Data Curation" role in CREDIT is roughly similar to the data submitter, but nothing really matches for the other two.

bcorrie commented 2 years ago

Yes, the credit and contact purposes are somewhat different. The MiAIRR fields are contact fields more than credit fields. If you want to ask someone about sample prep, ask the collected_by person, curation ask submitted_by, and about the study in general the study_contact.

These aren't really providing credit - and maybe MiAIRR doesn't need to?

schristley commented 2 years ago

These aren't really providing credit - and maybe MiAIRR doesn't need to?

I think not. The Contributor object is to standardize person information, that makes sense but I don't think CREDIT roles are really necessary.

However, one limitation with the current design is only one person can be assigned. There are sometimes multiple study contacts, and there are often multiple data submitters. For example, when Kira did metadata curation, and I did the data processing, I'd like to list both of us as the data submitters so we both get credit...

bussec commented 2 years ago

@bcorrie The "long" version of the Contributor record looks good to me. I also agree that we should have an array of Contributor objects in a study, so that proper credit can be provided.

Regarding the "contact" roles defined in MiAIRR: I also would consider them to be only weakly correlated with the credit information. I could think of two ways to combine them:

  1. A Contritbutor record contains study_contact, collected_by and submitted_by as boolean fields, so you can flag the respective person.
  2. We keep study_contact, collected_by and submitted_by as properties of Study, but they contain an index to the respective record in the Contributors array.
schristley commented 2 years ago

This ontology has a more complete set of roles, though I'm not sure if anything actually matches as contact (supervisor role?). Though maybe we can request a term...

schristley commented 2 years ago

@bcorrie The "long" version of the Contributor record looks good to me. I also agree that we should have an array of Contributor objects in a study, so that proper credit can be provided.

Agreed, also credit should be an array so that multiple roles can be assigned.

  1. A Contritbutor record contains study_contact, collected_by and submitted_by as boolean fields, so you can flag the respective person.
  2. We keep study_contact, collected_by and submitted_by as properties of Study, but they contain an index to the respective record in the Contributors array.

or 3. We designate/document 3 terms from this ontology for those roles, for example:

bussec commented 2 years ago

@schristley Now looked at this in more detail: CRO is a complete superset of the CRediT taxonomy, the terms even have the same names and definitions. So we get the additional term we need for free when using it and it is simpler to integrate as it is an OBO Foundry ontology as well.

bussec commented 2 years ago

Where should the array containing the Contributor records be located in the schema? Is it rather

  1. an independent top-level object, or
  2. a property of the Study object?

@williamdlees Would either of these work for the Germline Acknowledgements? Or would the Contributor records need internal IDs for referencing?

williamdlees commented 2 years ago

(1) should work. At the moment there is a top level Acknowledgement object which is used by AlleleDescription and GermlineSet. It has an acknowledgement_id but it's not used.

schristley commented 2 years ago
  1. a property of the Study object?

I'd probably prefer this to keep it simple, without needing to create identifiers, worrying about uniqueness and so forth. This also allows the contributor role to be different for study versus germline.

schristley commented 2 years ago

Potential structure...

Contributor : {
    contributor_pid: string, (e.g. "ORCID:0000-0003-3888-6495")
    name: string,
    email: string,
    affiliation_pid: string,
    affiliation_name: string,
    affiliation_address: string,
    roles: array
}

do we need to allow multiple affiliations?

bcorrie commented 2 years ago

do we need to allow multiple affiliations?

ORCID handles those complexities, maybe we can state that this is the primary affiliation that one has with this study and defer to ORCID for complex relationships and affiliations.

bussec commented 2 years ago

Regarding affiliation_address: Are we ok with <city>,[state,]<country> ? This information is already in ROR, so we could just pull it from there. Or is anyone still receiving physical mail these days? :-)

bcorrie commented 2 years ago

Does it make sense to use PID id/label pairs as I did in: https://github.com/airr-community/airr-standards/issues/552#issuecomment-1016809146

This provides more consistent use with other URI based PID objects in the standard. We might do this for the person and institution (e.g. affiliation.id, affiliation.label) rather than have custom fields (affiliation_pid, affiliation_name) for those objects.

schristley commented 2 years ago

Does it make sense to use PID id/label pairs as I did in: #552 (comment)

This provides more consistent use with other URI based PID objects in the standard. We might do this for the person and institution (e.g. affiliation.id, affiliation.label) rather than have custom fields (affiliation_pid, affiliation_name) for those objects.

Are we requiring that all contributors have an ORCID? And if they don't, then how should their information be recorded?

schristley commented 2 years ago

Regarding affiliation_address: Are we ok with <city>,[state,]<country> ? This information is already in ROR, so we could just pull it from there. Or is anyone still receiving physical mail these days? :-)

Yes, that is probably okay. What about department name(s)?

I think it's reasonable that we don't expect this to be complete and thorough contact information, but enough information that somebody could find the person with some extra googling? On the other hand, if it's a legal contact then it likely needs to be as specific as possible.

bussec commented 2 years ago

What about department name(s)?

This is currently beyond the scope of ROR. They are working on this, but I don't expect this to happen any time soon. So either we have a free text field for this or skip the information altogether.

Are we requiring that all contributors have an ORCID?

If we are using these fields as ID/label pairs this would be a consequence of it (as the IDs must resolve to the label or a synonym for it).

schristley commented 2 years ago

Are we requiring that all contributors have an ORCID?

If we are using these fields as ID/label pairs this would be a consequence of it (as the IDs must resolve to the label or a synonym for it).

Right. That's why I avoided the Ontology ID/label with my suggested structure. That is, the ORCID (or other PID) can be provided if it's available but it isn't required.

bussec commented 2 years ago

Some further points, based on a couple of RDA-DE talks today and yesterday:

  1. ORCID has started to support RORs (actually they want to make it their primary institutional identifier in the near future).
  2. ORCID is also supporting CRediT, which apparently has seen quite a good uptake especially with publishers
  3. CRediT is deliberately designed to contributor roles, not administrative roles (i.e. contact person).
  4. There is a third vocabulary for contributor roles in the DataCite metadata schema (see page 20, which also contains a "ContactPerson" role.