IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
858 stars 482 forks source link

Allow multiple depositors for datasets #5164

Open rmo-cdsp opened 5 years ago

rmo-cdsp commented 5 years ago

Hello,

While handling an issue (#4593 ), I had troubles importing a ddi file having multiple depositors () in it because Dataverse allows only ONE depositor per dataset. I talked with @pdurbin on dataverse irc chat about this (http://irclog.iq.harvard.edu/dataverse/2018-10-11#i_75331), and agreed that handling multiple depositors should be supported.

So, I opened this issue to deal with it :)

pdurbin commented 5 years ago

@rmo-cdsp thanks for opening this issue. My understand from @jggautier and from looking at http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/field_level_documentation_files/schemas/codebook_xsd/schema-overview.html#xml_source (screenshot below) is that we should allow multiple depositors because of this line:

<xs:element maxOccurs="unbounded" minOccurs="0" ref="depositr"/>

screen shot 2018-10-11 at 9 31 11 am

The change would need to be made to scripts/api/data/metadatablocks/citation.tsv

Note how in the screenshot below "allowmultiples" is set to false for "depositor":

screen shot 2018-10-11 at 9 34 11 am

jggautier commented 5 years ago

Hi @rmo-cdsp. From the irc chat, it sounded like you might look into why there are multiple depositors. Were you able to look into those use cases?

Also, because Dataverse pre-populates the depositor field with the name of the account used to create the first dataset version, and continues to pre-populate that field with that first account name every time a new dataset version is created, and I suspect that few people creating new versions remove that pre-populated name and add their own, the field acts as more of a record of the account used to deposit the dataset's first version (I see examples of this in Harvard Dataverse and UNC Dataverse).

So I'm wondering:

rmo-cdsp commented 5 years ago

@jggautier I work for the CDSP (https://cdsp.sciences-po.fr/fr/ , yes, not translated yes but lot of info !), which is an entity that doesn't produce that much data, but instead handle the datas of other entities. So, in that case (and from what I understood from my co-workers handling this), the depositor is the entity that gave us the datas. The owners, I may say. So, when the ddi files end up on the dataverse, the depositors of the datas are the entities that asked for it, not really us. Your definition is "The person (Family Name, Given Name) or the name of the organization that deposited this Dataset to the repository.". But in that case, the repository is "us", in some way. And the entity deposited their datas into our entity. So, where the depositors should end up ? I don't know if we can consider them as "creators" of the data, because sometimes it's an administrative hell to understand who really did what.

Let's end with my thoughts and answer your questions:

Here are my thoughts, I will organize a little meeting with my data managers in order to have their precise point of view about your questions. I hope you understood my point of view and are brave enough to read all :D

jggautier commented 5 years ago

Thanks! I think I understand:

Dataverse's definition and DDI's definition seem to be in conflict in cases where (1) the person/organization who owns the account used to physically deposit the data and (2) the person/organization that owns the data and gave it to the archive storing it are different.

The depositor field being pre-populated is a convenience when the Dataverse account owner who is depositing the data also owns the data. But the pre-populated value needs to be changed when those two "entities" are different, which will happen a lot in cases when data is being migrated to Dataverse.

The depositor field is editable for these cases - that is, so that if the person/organization who owns the data isn't the person physically uploading the data, that person should change the pre-populated value (the account name) to the name of the person/organization that owns the data.

This makes the depositor field less useful for tracking who physically deposited each dataset version, and instead records who owns the data. (The names of accounts that make changes to each saved and published dataset version are displayed in the contributor column of the dataset versions table.)

I agree that multiple depositors makes more sense when following the DDI definition, and I'm thinking the definition displayed in the UI and elsewhere should be changed to the DDI definition you quoted.

Thanks for getting into the weeds with us! Looking forward to learning more after your meeting with your data managers.

rmo-cdsp commented 5 years ago

I just ran into an other similar event (just had new datas to import from my data managers): the samplingProcedure (ddi:sampProc) can also be used with multiple fields. The dataverse schema only allows one. Shall I regroup every sampProc into one field (given it is a textarea in the app) or should the field be multiple in dataverse ? Multiple fields would be cleaner imho, as with depositors.

EDIT: just got a new case with dataverse dataCollectionSituation fied (ddi:collSitu): there was 2 collSitu for one dataColl, and seems to be ok on ddi side :S

jggautier commented 5 years ago

Looks like most of the fields Dataverse puts in its social science metadata block don't allow multiples, while DDI does allow multiples for many of the fields.

@janetm, when ADA migrated to Dataverse, were there instances like this, where there were problems with importing metadata because of multiple depositors or multiple instances of other fields that Dataverse allows only one instance of by default?

rmo-cdsp commented 5 years ago

I changed the value of allowMultiple for some fields to "True" (Citation:depositor, socialScience: samplingProcedure, socialScience: dataCollector, socialScience:dataColSitu). I dropped my db and used the startup script. I can't make a new dataset with my ddi import api, I have an error like this:

"Error parsing datas as Json: incorrect multiple for field samplingProcedure" It does the same thing for each field I edited in the tsv files. Also, if there is only one tag in my xml, the error keeps going. I have to remove every tags I edited from my xml files.

Now, on the gui side, I can't add fields to multiple fields (with the "+" button). Nothing happens in the logs. I checked in the network logs, and saw that the query response is only a CDATA tag, something like this:

<partial-response id="j_id1"><changes><update id="j_id1:javax.faces.ViewState:0"><![CDATA[1366648494339466728:-3140437538934724148]]></update></changes></partial-response>

I checked on other Dataverse apps and, from what I understand, the response should contain the html code to add the field to the page in the CDATA tag. I checkde my tsv files and didn't see anything wrong ... so I'm kinda lost.

pdurbin commented 5 years ago

@rmo-cdsp hi, can you please go ahead and push the changes you've made to a branch so we can take a look?

rmo-cdsp commented 5 years ago

@pdurbin here is the branch https://github.com/rmo-cdsp/dataverse/tree/5164-from-4593-test

pdurbin commented 5 years ago

@rmo-cdsp thanks I ran ec2-create-instance.sh -b 5164-from-4593-test -r https://github.com/rmo-cdsp/dataverse.git using our new script at http://guides.dataverse.org/en/4.9.4/developers/deployment.html to deploy your branch to http://ec2-52-90-11-27.compute-1.amazonaws.com:8080/

(The password for dataverseAdmin is "admin1" if anyone wants to poke around with it.)

The odd thing is that "Kind of Data" is also broken in the same way even though you didn't touch it. Clicking the "plus" sign (+) just gives a spinner and doesn't work. I thought maybe this is a bug but "Kind of Data" works fine on https://demo.dataverse.org

I took a quick look at your changes using "compare" at https://github.com/IQSS/dataverse/compare/develop...rmo-cdsp:5164-from-4593-test and I think you're making the correct changes to the tsv file to make depositors allow multiple. I'm at a loss.

Also, I wanted to note that #5205 was discussed in sprint planning yesterday and is in the next sprint. I'll link back to this issue as an example of the struggles people have.

pdurbin commented 5 years ago

@rmo-cdsp thanks for opening #5212 about the "plus" button not working in the "develop" branch.

rmo-cdsp commented 5 years ago

Another thing had to be changed in order to allow multiple depositors: change the schema.xml file for solr. Here it is: https://github.com/IQSS/dataverse/blob/develop/conf/solr/7.3.0/schema.xml

You have to edit the <field> tag with the attribute "name" equal to the field you want to change. Change the attribute value "multiValued" to true if you want multiple value. In my case, the previous tag was this: <field name="depositor" type="text_en" multiValued="false" stored="true" indexed="true"/> And became this: <field name="depositor" type="text_en" multiValued="true" stored="true" indexed="true"/>

You then have to replace the schema.xml your solr has by the edited one (don't if you changed it directly in your solr installation) stop and start solr and everything should be ok.

In my case, I was able to create a dataset with multiple depositors after that.

pdurbin commented 5 years ago

@rmo-cdsp I took a quick look at pull request #5223 and a few things stand out to me:

Basically, above I'm wondering about places in the code where there hard coded assumptions about depositors being single rather than multiple.

jggautier commented 3 years ago

Hi @rmo-cdsp. Before the conversation in this issue was expanded to other DDI Codebook fields that technically allow multiple when Dataverse allows only one instance, it sounded like your group agreed that only one instance of depositor was necessary. Is that correct?

I'm in a group reviewing updates to DDI Codebook and I'm wondering if, based on our discussion in this GitHub issue, I should propose that in the next version of DDI Codebook the depositor field be restricted to one instance and/or the documentation clarify that only one value should be used for depositor. What do you think?

About the other fields that DDI Codebook technically allows to be repeated when Dataverse allows only one instance, I think changes to each field should be addressed individually. Perhaps each field should get its own GitHub issue, since each case might be different. For example, if there are multiple sampling procedures, doesn't that indicate that there are multiple surveys, and each survey should really be its own dataset? I don't mean to discuss this argument in this GitHub issue; just trying to make the case that changes to each field should be addressed individually. Recently a similar discussion about another DDI field, geospatial bounding box, has led to documentation changes being proposed for the next version of DDI Codebook.

So what I'm asking is:

More generally, DDI Codebook allows for a lot of flexibility, I think to increase adoption, and I think sometimes that flexibility can be taken advantage of at the expense of usability/interoperability. There's also been hesitation to change what's technically allowed in the DDI Codebook schema because they want to maintain backwards compatibility. This is happening with the geospatial bounding box situation, where the solution has been to clarify the documentation instead of adding restrictions in the XML schema. So I think it's important to consider the merits of using a field a certain way, like allowing multiple instances of it even when the schema technically allows for it.

pdurbin commented 1 year ago

@rmo-cdsp hi! Are you still interested in this? You created this PR but now it has merged conflicts:

@jggautier if you could look at that PR and bless (or not!) that change, I'd appreciate it. No rush! 😄 It amounts to making depositor in citation.tsv allow multiples.

jggautier commented 1 year ago

@rmo-cdsp wrote earlier in this issue that they don't need this field to allow multiples since each of their dataset's metadata has only one depositor, but they're in favor of this change anyway and I agree.

The change to make depositor in citation.tsv allow multiples is tied to the definition of the Depositor field, and I think that definition should be the focus of this GitHub issue (or a new issue with that focus should be opened instead).

I think the Depositor field has allowed only one entry because we thought it would record only the name of the person or organization that created the dataset. But as @rmo-cdsp pointed out, that's not always true. Sometimes it should be the names of the people or organizations that gave the data to the repository, and in that case there are two different groups of people or organizations to consider:

  1. the people or organizations that gave the data to the person or organization that uploads the data
  2. the person or organization that uploads the data

At the very least, I think the description/tooltip of the Depositor field should also be improved to make this dual purpose of the field more clear. As part of https://github.com/IQSS/dataverse/issues/8127 it was changed to be:

The entity, such as a person or organization, that deposited the Dataset in the repository

Is that definition too vague? The name of the term is repeated in its definition, and sometimes that's a sign that the definition could be improved.

Another GitHub issue could be opened to explore improving the use of this field. Since the software already records the account used to upload the data, maybe this field should be used to record only the people or organizations that gave the data to the Dataverse repository. And in that case, the Dataverse software may not need to pre-fill that field with information from the account that was used to create the dataset.

Maybe this could be a topic in a future meeting of the Dataverse Metadata Interest Group.