NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
26 stars 12 forks source link

Solr Index processor don't parse the attributes on the otherEntity on an EML object #1361

Closed taojing2002 closed 3 years ago

taojing2002 commented 5 years ago

Eric from GRIL reported that the attributes of some eml objects can't be indexed. It turns out that those attributes are under the otherEntity element. Our processor only parses the attributes under the dataTable element.

taojing2002 commented 5 years ago

Here is xpath for the attributeName:

//dataTable/attributeList/attribute/attributeName/text()

Chris suggests it can be

//attributeList/attribute/attributeName/text()

Now we have this xpath for the attributeUnit:

/dataTable//standardUnit/text() | //dataTable//customUnit/text()

I propose to change to:

//attributeList/attribute//standardUnit/text() | //attributeList/attribute//customUnit/text()
taojing2002 commented 5 years ago

The all fields I think should be modified are:

eml.attributeName //dataTable/attributeList/attribute/attributeName/text()
eml.attributeLabel //dataTable/attributeList/attribute/attributeLabel/text()
eml.attributeDescription //dataTable/attributeList/attribute/attributeDefinition/text()
eml.attributeUnit //dataTable//standardUnit/text() | //dataTable//customUnit/text()
eml.attributeTextRoot //dataTable/attributeList/attribute
eml.attributeName.noDupe  //dataTable/attributeList/attribute/attributeName/text()
eml.attributeLabel.noDupe  //dataTable/attributeList/attribute/attributeLabel/text()
eml.attributeDescription.noDupe //dataTable/attributeList/attribute/attributeDefinition/text()
eml.attributeUnit.noDupe //dataTable//standardUnit/text() | //dataTable//customUnit/text()

The new values I propose to be:

eml.attributeName //attributeList/attribute/attributeName/text()
eml.attributeLabel //attributeList/attribute/attributeLabel/text()
eml.attributeDescription //attributeList/attribute/attributeDefinition/text()
eml.attributeUnit //attributeList/attribute//standardUnit/text() | //attributeList/attribute//customUnit/text()
eml.attributeTextRoot //attributeList/attribute
eml.attributeName.noDupe  //attributeList/attribute/attributeName/text()
eml.attributeLabel.noDupe  //attributeList/attribute/attributeLabel/text()
eml.attributeDescription.noDupe //attributeList/attribute/attributeDefinition/text()
eml.attributeUnit.noDupe //attributeList/attribute//standardUnit/text() | //attributeList/attribute//customUnit/text()

Please review the change (particularly the eml.attributeTextRoot, eml.attributeUnit and eml.attributeUnit.noDupe )

The all list of fields can be found: https://repository.dataone.org/software/cicore/trunk/cn-buildout/dataone-cn-index/usr/share/dataone-cn-index/debian/index-generation-context/application-context-eml-base.xml

csjx commented 5 years ago

Hi @taojing2002 - This looks correct to me. Thanks for evaluating the XPATHs.

taojing2002 commented 5 years ago

@csjx Thanks for reviewing.

datadavev commented 5 years ago

Looks correct to me as well.

taojing2002 commented 5 years ago

@datadavev thanks!