Correct Software Name metadata in some created datasets

jggautier commented 4 years ago

Many datasets in Harvard Dataverse, including 352 of the 430 datasets in the unpublished dataverse "American Mass Public Opinion in the 1930s and 1940s Dataverse" (https://dataverse.harvard.edu/dataverse/WWII) have a number in their Software Name metadata field.

Screen Shot 2019-08-23 at 11 11 19 AM

In the case of "American Mass Public Opinion in the 1930s and 1940s Dataverse", I think Stata was meant to go in the software name (looks like most of the files are Stata files).

It looks to me like there was some problem with the way the metadata was imported using some script or API.

I think we'll need to figure out the scope of this problem:

Which datasets have incorrect software name metadata because of a bug during Harvard Dataverse Repository's migration from Dataverse version 3 to 4? Which datasets have incorrect software name metadata because of a bug in an API that a depositor used to deposit datasets (anytime after the migration to Dataverse 4)?
If the problem happened during the Harvard Dataverse Repository migration from Dataverse version 3 to 4 or due to a bug with a Dataverse API, should we correct the mistakes ourselves if we're able to (e.g. when it's easy to figure out what the software name should be), and do we need to contact the depositors in the process? Or should we always contact the depositors to ask them to update their dataset metadata?

jggautier commented 4 years ago

jggautier commented 3 years ago

I just updated the first comment I left in this GitHub issue to clarify the problem and questions I think we'll need to answer to resolve this problem. Also queried the database again to get the current number of datasets affected, the collections they're in and when they were created and published (if published): 544 datasets in 26 Dataverse collections (none of the datasets are in the "Root" collection).

To get this info I queried the database for all software name metadata in the most recent versions of all datasets, exported the results to a csv file and in Excel filtered the list down to metadata that contained only numbers or things that were obviously not the names of software.

Query

``` select dataverse.name as dataversename, dvobject.identifier, datasetfield1.parentdatasetfieldcompoundvalue_id, datasetfieldvalue.value as softwarename, dvobject.createdate, dvobject.publicationdate from datasetfield datasetfield1 join datasetfieldvalue on datasetfieldvalue.datasetfield_id = datasetfield1.id join datasetfieldcompoundvalue on datasetfieldcompoundvalue.id = datasetfield1.parentdatasetfieldcompoundvalue_id join datasetfield datasetfield2 on datasetfield2.id = datasetfieldcompoundvalue.parentdatasetfield_id join datasetversion on datasetversion.id = datasetfield2.datasetversion_id join datasetfieldtype on datasetfieldtype.id = datasetfield2.datasetfieldtype_id join dataset on dataset.id = datasetversion.dataset_id join dvobject on dvobject.id = dataset.id join dataverse on dataverse.id = dvobject.owner_id where datasetfield1.datasetfieldtype_id = 69 and datasetfield1.template_id is null and harvestingclient_id is null and datasetversion.createtime in (select max(datasetversion.createtime) as max from datasetversion group by datasetversion.dataset_id) ```

pdurbin commented 3 years ago

To me it looks like the fix from issue 50 above has already been added in https://github.com/IQSS/dataverse/commit/a53c857

jggautier commented 3 years ago

Awesome! So the problem shouldn't happen again.

sbarbosadataverse commented 2 years ago

@jggautier get a list of all datasets impacted, we can see the scope of the work to clean this up without help from development.

jggautier commented 2 years ago

The list is in a Google sheet in the Dataverse Google Drive.

There are 527 datasets. 422 of those are unpublished datasets in the "American Mass Public Opinion in the 1930s and 1940s Dataverse" collection at https://dataverse.harvard.edu/dataverse/WWII. "10" is in the Software name and version fields of those 422 datasets. Stata should be in the Software name field, instead. It wouldn't be technically hard to use APIs to update the metadata for those unpublished datasets.

The Google sheet includes a table with the counts of datasets by Dataverse collection. To figure out what should go in the software name fields of the other datasets:

We could explore the types of files in those datasets. The database could be queried for this, too.
Or if we need/plan to contact the admins of those collections, we could ask them what should go in those fields and even ask them to update the datasets. There are 3 collections with between 20-27 datasets. The other 22 collections have 9 or datasets with a version number/string in both fields.

IQSS / dataverse.harvard.edu

Correct Software Name metadata in some created datasets #30