IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34
5 stars 1 forks source link

Dataset Metadata - Notes showing "nullVersion Text" (fix current data) #27

Open suenjedt opened 9 years ago

suenjedt commented 9 years ago

Splitting this into two tickets

  1. This ticket: Fix current data
  2. New ticket: Update migration script #2179

Should not show "null" like that in the "notes" field. See example here https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/19465 screen shot 2015-04-23 at 12 20 25

Reported by @pdurbin

raprasad commented 9 years ago

May have been part of migration process. Found in staging data:

No. string # instances
1 null 16,942
2 Version Date: null 2,162
3 Subject: null 15,671
3a Subject: null and Type: null 12,671
4 /faces/study/EditStudyPage.xhtml?null 118
5 &studyId=null 58

1 - Select count(id) from datasetfieldvalue where value like '%null%'; 2 - Select count(id) from datasetfieldvalue where value like '%Version Date: null%'; 3a - Select count(id) from datasetfieldvalue where value like '%Subject: null%' and value like '%Type: null%'; 4 - fyi: the links work 5 - fyi: the links work. Full link: http://thedata.harvard.edu/dvn/faces/study/TermsOfUsePage.xhtml?tou=deposit&studyId=null&redirectPage=%2Ffaces%2Fstudy%2FEditStudyPage.xhtml%3Fnull&vdcId=1244

query to help find nulls

SELECT count(id) FROM datasetfieldvalue 
WHERE VALUE LIKE '%null%' 
AND NOT VALUE LIKE '%Version Date: null%' 
AND NOT  VALUE LIKE '%Subject: null%' 
AND NOT VALUE LIKE '%Type: null%'
AND NOT VALUE LIKE '%EditStudyPage.xhtml?null%';
sbarbosadataverse commented 9 years ago

@scolapasta @eaquigley can this be closed? is any of this related to the "null" search values reported by Liz today?

pdurbin commented 9 years ago

is any of this related to the "null" search values reported by Liz today?

No, that's #1338

pdurbin commented 7 years ago

@raprasad it's fantastic that you included the SQL queries you used to fix this bug (find the nulls so we can clean up the data) at https://github.com/IQSS/dataverse/issues/2075#issuecomment-100995851 and I'd like to find a good home for them. Is this out of scope for https://github.com/IQSS/miniverse/tree/master/dv_apps/metrics ? If not, can we copy the SQL queries into a new issue there saying "Detect nulls in dataset metadata" or something? My other thought is that this could be fall under "other administrative tasks" that we add to the Administrative Dashboard mentioned at http://dataverse.org/goals-roadmap-and-releases

pdurbin commented 7 years ago

@raprasad do you feel like adding your scripts to pull request #3877?

djbrooke commented 5 years ago

@jggautier - I moved this into dataverse.harvard.edu because there's still a cleanup task here (see the link in the first comment). If you think this is worth our time, let me know.

jggautier commented 5 years ago

Thanks @djbrooke. It looks like migration code used to move DDI metadata into Dataverse 4 - maybe during the migration from Dataverse 3 to 4, during some batch uploading of DDI metadata - and harvesting from repositories like ICPSR collapsed the following four metadata fields into Dataverse 4's single Notes field: Version date, Notes Subject, Notes Type, and Notes Text

When the code didn't find a value for one of those four fields, like Notes Subject, it added "null". That the code was doing this was considered a bug and the bug was fixed (https://github.com/IQSS/dataverse/issues/2179).

There are now 274 non-harvested datasets whose latest dataset versions have ": null" in their Notes field:

SQL query

``` select identifier, createdate, publicationdate, value, dataverse.name from datasetfieldvalue left outer join datasetfield on datasetfieldvalue.datasetfield_id = datasetfield.id left outer join datasetfieldtype on datasetfield.datasetfieldtype_id = datasetfieldtype.id left outer join datasetversion on datasetfield.datasetversion_id = datasetversion.id join dataset on dataset.id = datasetversion.dataset_id join dvobject on dvobject.id = dataset.id join dataverse on dataverse.id = dvobject.owner_id where harvestingclient_id is null and value like '%: null%' and datasetfieldtype.name = 'notesText' and datasetversion.createtime in (select max(datasetversion.createtime) as max from datasetversion group by datasetversion.dataset_id) ```

There are 13,597 harvested datasets in Harvard Dataverse that have ": null" in their Notes fields. 17 datasets harvested from UNC Dataverse in 2016 still have "null" in their Notes fields (e.g. https://hdl.handle.net/1902.29/10984). The rest were imported or harvested from non-Dataverse repositories in 2015 or earlier.

SQL query

``` select dataverse.name, count(*) from datasetfieldvalue left outer join datasetfield on datasetfieldvalue.datasetfield_id = datasetfield.id left outer join datasetfieldtype on datasetfield.datasetfieldtype_id = datasetfieldtype.id left outer join datasetversion on datasetfield.datasetversion_id = datasetversion.id join dataset on dataset.id = datasetversion.dataset_id join dvobject on dvobject.id = dataset.id join dataverse on dataverse.id = dvobject.owner_id where harvestingclient_id is not null and value like '%null%' and datasetfieldtype.name = 'notesText' group by dataverse.name order by count(*) desc ```

If it's correct that "null" values really used to be blank values, then I think it's noise that might be degrading search results a little and might confuse people who are trying to use the metadata to learn more about the data. Here are a few datasets with "null" in their Notes fields:

Depending on how much time it would take, it might be worth it to remove every instance of "Subject: null", "Version Date: null", "Type: null", and "Notes: ;" from the Notes fields of the 149 published and 125 unpublished datasets in Harvard Dataverse. I wonder if the number of datasets is also small enough, and concentrated in around a dozen dataverses/subdataverses (most are from IFPRI and IRRI dataverses), that we could let the dataverse owners know, and they could edit the fields themselves (as long as owners of the published datasets don't mind creating new minor versions).

This might involve more dev work, but would it be possible to move the Notes Type, Notes Subject, and Notes Text metadata into the Notes fields in the Social Sciences metadata block:

Screen Shot 2019-07-23 at 2 31 08 PM

Then we would have to figure out what to do with the Version Dates. I looked at a few datasets and saw that those dates are the same dates already stored in the dataset version table, so they might be redundant, but we would need to check more comprehensively to be sure, or talk to someone who knows or can figure out how the migration code retrieved those Version Dates.