NCEAS / metadig

Approaches and tools for Metadata Improvement and Guidance.
Apache License 2.0
7 stars 0 forks source link

dataone SOLR index gives corrupted metadata #24

Closed mbjones closed 9 years ago

mbjones commented 9 years ago

We have a known issue whereby the DataONE SOLR index is mangling some metadata during the indexing process (see issue 6800), and therefore returns erroneous values. This is manifesting itself in our metadata sampling by producing erroneous node identifiers. As a workaround, we should be using the AuthoritativeMN field in the system metadata rather than what is returned from SOLR when creating directories for serializing a metadata document.

Related to issue #2 .

mbjones commented 9 years ago

This is now working to eliminate corrupted MN entries from SOLR, but for some reason a stray directory is being created for a subset of the dryad documents. The result/datadryad.org subtree should by under the result/DRYAD tree in the output below, and it should not have the profile/3.1 sub directories. Appears to be parsing the formatId somehow. Once this is fixed, this bug can be closed.

result/DRYAD
├── Dryad_Metadata_Application_Profile_Version_3.1
│   └── xml
│       ├── 00021-metadata.xml
│       ├── 00022-metadata.xml
│       ├── 00023-metadata.xml
│       ├── 00024-metadata.xml
│       └── 00025-metadata.xml
└── sysmeta
    └── xml
        ├── 00021-sysmeta.xml
        ├── 00022-sysmeta.xml
        ├── 00023-sysmeta.xml
        ├── 00024-sysmeta.xml
        └── 00025-sysmeta.xml
result/datadryad.org
└── profile
    └── v3.1
        ├── Dryad_Metadata_Application_Profile_Version_3.1
        │   └── xml
        │       ├── 00111-metadata.xml
        │       ├── 00112-metadata.xml
        │       └── 00123-metadata.xml
        └── sysmeta
            └── xml
                ├── 00111-sysmeta.xml
                ├── 00112-sysmeta.xml
                └── 00123-sysmeta.xml
amoeba commented 9 years ago

Two things:

First, the result/datadryad.org/profile/v3.1 bug:

This looks like it's caused by a typo at https://github.com/NCEAS/metadig/blob/master/sample-metadata.py#L326

I think that line should be changed from:

node_id_element = meta_xml.find("./formatId")

to

node_id_element = meta_xml.find("./authoritativeMN")

so node_identifier will be set to the authoritativeMN instead of formatId. I'll throw this change up here soon.

Second, the documents with bad Solr index values for authoritativeMN:

I downloaded the entire production Solr index to a CSV and found that only three documents have bogus authoritativeMN values:

http://dx.doi.org/10.5061/dryad.75n5q/5?ver=2015-06-17T15:50:00.307-04:00 http://dx.doi.org/10.5061/dryad.vq33n/11?ver=2015-07-08T09:38:18.171-04:00 http://dx.doi.org/10.5061/dryad.gp23s/1?ver=2015-07-06T14:05:02.521-04:00

which have system meta that live at (in the same order):

https://cn.dataone.org/cn/v1/meta/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.gp23s%2F1%3Fver%3D2015-07-06T14%3A05%3A02.521-04%3A00 https://cn.dataone.org/cn/v1/meta/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.75n5q%2F5%3Fver%3D2015-06-17T15%3A50%3A00.307-04%3A00 https://cn.dataone.org/cn/v1/meta/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.vq33n%2F11%3Fver%3D2015-07-08T09%3A38%3A18.171-04%3A00

The common thread is that they're all Dryad documents. Just thought I'd add that so its documented somewhere.