Open jggautier opened 4 years ago
I just updated the first comment I left in this GitHub issue to clarify the problem and questions I think we'll need to answer to resolve this problem. Also queried the database again to get the current number of datasets affected, the collections they're in and when they were created and published (if published): 544 datasets in 26 Dataverse collections (none of the datasets are in the "Root" collection).
To get this info I queried the database for all software name metadata in the most recent versions of all datasets, exported the results to a csv file and in Excel filtered the list down to metadata that contained only numbers or things that were obviously not the names of software.
To me it looks like the fix from issue 50 above has already been added in https://github.com/IQSS/dataverse/commit/a53c857
Awesome! So the problem shouldn't happen again.
@jggautier get a list of all datasets impacted, we can see the scope of the work to clean this up without help from development.
The list is in a Google sheet in the Dataverse Google Drive.
There are 527 datasets. 422 of those are unpublished datasets in the "American Mass Public Opinion in the 1930s and 1940s Dataverse" collection at https://dataverse.harvard.edu/dataverse/WWII. "10" is in the Software name and version fields of those 422 datasets. Stata should be in the Software name field, instead. It wouldn't be technically hard to use APIs to update the metadata for those unpublished datasets.
The Google sheet includes a table with the counts of datasets by Dataverse collection. To figure out what should go in the software name fields of the other datasets:
Many datasets in Harvard Dataverse, including 352 of the 430 datasets in the unpublished dataverse "American Mass Public Opinion in the 1930s and 1940s Dataverse" (https://dataverse.harvard.edu/dataverse/WWII) have a number in their Software Name metadata field.
In the case of "American Mass Public Opinion in the 1930s and 1940s Dataverse", I think Stata was meant to go in the software name (looks like most of the files are Stata files).
It looks to me like there was some problem with the way the metadata was imported using some script or API.
I think we'll need to figure out the scope of this problem: