Closed mbjones closed 9 years ago
This is now working to eliminate corrupted MN entries from SOLR, but for some reason a stray directory is being created for a subset of the dryad documents. The result/datadryad.org
subtree should by under the result/DRYAD
tree in the output below, and it should not have the profile/3.1
sub directories. Appears to be parsing the formatId
somehow. Once this is fixed, this bug can be closed.
result/DRYAD
├── Dryad_Metadata_Application_Profile_Version_3.1
│ └── xml
│ ├── 00021-metadata.xml
│ ├── 00022-metadata.xml
│ ├── 00023-metadata.xml
│ ├── 00024-metadata.xml
│ └── 00025-metadata.xml
└── sysmeta
└── xml
├── 00021-sysmeta.xml
├── 00022-sysmeta.xml
├── 00023-sysmeta.xml
├── 00024-sysmeta.xml
└── 00025-sysmeta.xml
result/datadryad.org
└── profile
└── v3.1
├── Dryad_Metadata_Application_Profile_Version_3.1
│ └── xml
│ ├── 00111-metadata.xml
│ ├── 00112-metadata.xml
│ └── 00123-metadata.xml
└── sysmeta
└── xml
├── 00111-sysmeta.xml
├── 00112-sysmeta.xml
└── 00123-sysmeta.xml
Two things:
First, the result/datadryad.org/profile/v3.1 bug:
This looks like it's caused by a typo at https://github.com/NCEAS/metadig/blob/master/sample-metadata.py#L326
I think that line should be changed from:
node_id_element = meta_xml.find("./formatId")
to
node_id_element = meta_xml.find("./authoritativeMN")
so node_identifier
will be set to the authoritativeMN instead of formatId. I'll throw this change up here soon.
Second, the documents with bad Solr index values for authoritativeMN:
I downloaded the entire production Solr index to a CSV and found that only three documents have bogus authoritativeMN values:
http://dx.doi.org/10.5061/dryad.75n5q/5?ver=2015-06-17T15:50:00.307-04:00 http://dx.doi.org/10.5061/dryad.vq33n/11?ver=2015-07-08T09:38:18.171-04:00 http://dx.doi.org/10.5061/dryad.gp23s/1?ver=2015-07-06T14:05:02.521-04:00
which have system meta that live at (in the same order):
https://cn.dataone.org/cn/v1/meta/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.gp23s%2F1%3Fver%3D2015-07-06T14%3A05%3A02.521-04%3A00 https://cn.dataone.org/cn/v1/meta/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.75n5q%2F5%3Fver%3D2015-06-17T15%3A50%3A00.307-04%3A00 https://cn.dataone.org/cn/v1/meta/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.vq33n%2F11%3Fver%3D2015-07-08T09%3A38%3A18.171-04%3A00
The common thread is that they're all Dryad documents. Just thought I'd add that so its documented somewhere.
We have a known issue whereby the DataONE SOLR index is mangling some metadata during the indexing process (see issue 6800), and therefore returns erroneous values. This is manifesting itself in our metadata sampling by producing erroneous node identifiers. As a workaround, we should be using the AuthoritativeMN field in the system metadata rather than what is returned from SOLR when creating directories for serializing a metadata document.
Related to issue #2 .