Open suenjedt opened 9 years ago
May have been part of migration process. Found in staging data:
No. | string | # instances |
---|---|---|
1 | null |
16,942 |
2 | Version Date: null |
2,162 |
3 | Subject: null |
15,671 |
3a | Subject: null and Type: null |
12,671 |
4 | /faces/study/EditStudyPage.xhtml?null |
118 |
5 | &studyId=null |
58 |
1 - Select count(id) from datasetfieldvalue where value like '%null%';
2 - Select count(id) from datasetfieldvalue where value like '%Version Date: null%';
3a - Select count(id) from datasetfieldvalue where value like '%Subject: null%' and value like '%Type: null%';
4 - fyi: the links work
5 - fyi: the links work. Full link: http://thedata.harvard.edu/dvn/faces/study/TermsOfUsePage.xhtml?tou=deposit&studyId=null&redirectPage=%2Ffaces%2Fstudy%2FEditStudyPage.xhtml%3Fnull&vdcId=1244
query to help find nulls
SELECT count(id) FROM datasetfieldvalue
WHERE VALUE LIKE '%null%'
AND NOT VALUE LIKE '%Version Date: null%'
AND NOT VALUE LIKE '%Subject: null%'
AND NOT VALUE LIKE '%Type: null%'
AND NOT VALUE LIKE '%EditStudyPage.xhtml?null%';
@scolapasta @eaquigley can this be closed? is any of this related to the "null" search values reported by Liz today?
is any of this related to the "null" search values reported by Liz today?
No, that's #1338
@raprasad it's fantastic that you included the SQL queries you used to fix this bug (find the nulls so we can clean up the data) at https://github.com/IQSS/dataverse/issues/2075#issuecomment-100995851 and I'd like to find a good home for them. Is this out of scope for https://github.com/IQSS/miniverse/tree/master/dv_apps/metrics ? If not, can we copy the SQL queries into a new issue there saying "Detect nulls in dataset metadata" or something? My other thought is that this could be fall under "other administrative tasks" that we add to the Administrative Dashboard mentioned at http://dataverse.org/goals-roadmap-and-releases
@raprasad do you feel like adding your scripts to pull request #3877?
@jggautier - I moved this into dataverse.harvard.edu because there's still a cleanup task here (see the link in the first comment). If you think this is worth our time, let me know.
Thanks @djbrooke. It looks like migration code used to move DDI metadata into Dataverse 4 - maybe during the migration from Dataverse 3 to 4, during some batch uploading of DDI metadata - and harvesting from repositories like ICPSR collapsed the following four metadata fields into Dataverse 4's single Notes field: Version date, Notes Subject, Notes Type, and Notes Text
When the code didn't find a value for one of those four fields, like Notes Subject, it added "null". That the code was doing this was considered a bug and the bug was fixed (https://github.com/IQSS/dataverse/issues/2179).
There are now 274 non-harvested datasets whose latest dataset versions have ": null" in their Notes field:
``` select identifier, createdate, publicationdate, value, dataverse.name from datasetfieldvalue left outer join datasetfield on datasetfieldvalue.datasetfield_id = datasetfield.id left outer join datasetfieldtype on datasetfield.datasetfieldtype_id = datasetfieldtype.id left outer join datasetversion on datasetfield.datasetversion_id = datasetversion.id join dataset on dataset.id = datasetversion.dataset_id join dvobject on dvobject.id = dataset.id join dataverse on dataverse.id = dvobject.owner_id where harvestingclient_id is null and value like '%: null%' and datasetfieldtype.name = 'notesText' and datasetversion.createtime in (select max(datasetversion.createtime) as max from datasetversion group by datasetversion.dataset_id) ```
There are 13,597 harvested datasets in Harvard Dataverse that have ": null" in their Notes fields. 17 datasets harvested from UNC Dataverse in 2016 still have "null" in their Notes fields (e.g. https://hdl.handle.net/1902.29/10984). The rest were imported or harvested from non-Dataverse repositories in 2015 or earlier.
``` select dataverse.name, count(*) from datasetfieldvalue left outer join datasetfield on datasetfieldvalue.datasetfield_id = datasetfield.id left outer join datasetfieldtype on datasetfield.datasetfieldtype_id = datasetfieldtype.id left outer join datasetversion on datasetfield.datasetversion_id = datasetversion.id join dataset on dataset.id = datasetversion.dataset_id join dvobject on dvobject.id = dataset.id join dataverse on dataverse.id = dvobject.owner_id where harvestingclient_id is not null and value like '%null%' and datasetfieldtype.name = 'notesText' group by dataverse.name order by count(*) desc ```
If it's correct that "null" values really used to be blank values, then I think it's noise that might be degrading search results a little and might confuse people who are trying to use the metadata to learn more about the data. Here are a few datasets with "null" in their Notes fields:
Depending on how much time it would take, it might be worth it to remove every instance of "Subject: null", "Version Date: null", "Type: null", and "Notes: ;" from the Notes fields of the 149 published and 125 unpublished datasets in Harvard Dataverse. I wonder if the number of datasets is also small enough, and concentrated in around a dozen dataverses/subdataverses (most are from IFPRI and IRRI dataverses), that we could let the dataverse owners know, and they could edit the fields themselves (as long as owners of the published datasets don't mind creating new minor versions).
This might involve more dev work, but would it be possible to move the Notes Type, Notes Subject, and Notes Text metadata into the Notes fields in the Social Sciences metadata block:
Then we would have to figure out what to do with the Version Dates. I looked at a few datasets and saw that those dates are the same dates already stored in the dataset version table, so they might be redundant, but we would need to check more comprehensively to be sure, or talk to someone who knows or can figure out how the migration code retrieved those Version Dates.
Splitting this into two tickets
Should not show "null" like that in the "notes" field. See example here https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/19465
Reported by @pdurbin