IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34
5 stars 1 forks source link

Correct Software Name metadata in some created datasets #30

Open jggautier opened 4 years ago

jggautier commented 4 years ago

Many datasets in Harvard Dataverse, including 352 of the 430 datasets in the unpublished dataverse "American Mass Public Opinion in the 1930s and 1940s Dataverse" (https://dataverse.harvard.edu/dataverse/WWII) have a number in their Software Name metadata field.

Screen Shot 2019-08-23 at 11 11 19 AM

In the case of "American Mass Public Opinion in the 1930s and 1940s Dataverse", I think Stata was meant to go in the software name (looks like most of the files are Stata files).

It looks to me like there was some problem with the way the metadata was imported using some script or API.

I think we'll need to figure out the scope of this problem:

jggautier commented 4 years ago

Related https://github.com/ualbertalib/dataverse/issues/50

jggautier commented 3 years ago

I just updated the first comment I left in this GitHub issue to clarify the problem and questions I think we'll need to answer to resolve this problem. Also queried the database again to get the current number of datasets affected, the collections they're in and when they were created and published (if published): 544 datasets in 26 Dataverse collections (none of the datasets are in the "Root" collection).

To get this info I queried the database for all software name metadata in the most recent versions of all datasets, exported the results to a csv file and in Excel filtered the list down to metadata that contained only numbers or things that were obviously not the names of software.

Query ``` select dataverse.name as dataversename, dvobject.identifier, datasetfield1.parentdatasetfieldcompoundvalue_id, datasetfieldvalue.value as softwarename, dvobject.createdate, dvobject.publicationdate from datasetfield datasetfield1 join datasetfieldvalue on datasetfieldvalue.datasetfield_id = datasetfield1.id join datasetfieldcompoundvalue on datasetfieldcompoundvalue.id = datasetfield1.parentdatasetfieldcompoundvalue_id join datasetfield datasetfield2 on datasetfield2.id = datasetfieldcompoundvalue.parentdatasetfield_id join datasetversion on datasetversion.id = datasetfield2.datasetversion_id join datasetfieldtype on datasetfieldtype.id = datasetfield2.datasetfieldtype_id join dataset on dataset.id = datasetversion.dataset_id join dvobject on dvobject.id = dataset.id join dataverse on dataverse.id = dvobject.owner_id where datasetfield1.datasetfieldtype_id = 69 and datasetfield1.template_id is null and harvestingclient_id is null and datasetversion.createtime in (select max(datasetversion.createtime) as max from datasetversion group by datasetversion.dataset_id) ```
pdurbin commented 3 years ago

To me it looks like the fix from issue 50 above has already been added in https://github.com/IQSS/dataverse/commit/a53c857

Screen Shot 2021-05-17 at 2 15 52 PM
jggautier commented 3 years ago

Awesome! So the problem shouldn't happen again.

sbarbosadataverse commented 2 years ago

@jggautier get a list of all datasets impacted, we can see the scope of the work to clean this up without help from development.

jggautier commented 2 years ago

The list is in a Google sheet in the Dataverse Google Drive.

There are 527 datasets. 422 of those are unpublished datasets in the "American Mass Public Opinion in the 1930s and 1940s Dataverse" collection at https://dataverse.harvard.edu/dataverse/WWII. "10" is in the Software name and version fields of those 422 datasets. Stata should be in the Software name field, instead. It wouldn't be technically hard to use APIs to update the metadata for those unpublished datasets.

The Google sheet includes a table with the counts of datasets by Dataverse collection. To figure out what should go in the software name fields of the other datasets: