DataONEorg / d1_cn_index_processor

The CN index processor component
0 stars 1 forks source link

Re-apply previously overwritten changes to XPath for ISOTC211 origin field #7

Closed amoeba closed 3 years ago

amoeba commented 3 years ago

In late 2017, Axiom pointed out what ended up being a bug in the XPath we use to populate the origin field in our ISOTC211 (ISO19115) indexing component. See https://redmine.dataone.org/issues/8165. The gist of the problem is that we were pulling any ResponsibleParty from the document and treating them as creators/authors which was too broad.

A fix was applied at some point and I can even see in Redmine 8504 that it was still present in 2018 but it ended up reverted at some point and the current value in the 2.3 branch is the original XPath.

I'm really not sure what happened but I think we should:

The fixed XPath was:

//gmd:identificationInfo/gmd:MD_DataIdentification/gmd:citation/gmd:CI_Citation/gmd:citedResponsibleParty/gmd:CI_ResponsibleParty[gmd:role/gmd:CI_RoleCode/text() = "owner" or gmd:role/gmd:CI_RoleCode/text() = "originator" or gmd:role/gmd:CI_RoleCode/text() = "principalInvestigator" or gmd:role/gmd:CI_RoleCode/text() = "author"]/gmd:individualName/*/text() | //gmd:identificationInfo/gmd:MD_DataIdentification/gmd:citation/gmd:CI_Citation/gmd:citedResponsibleParty/gmd:CI_ResponsibleParty[(gmd:role/gmd:CI_RoleCode/text() = "owner" or gmd:role/gmd:CI_RoleCode/text() = "originator" or gmd:role/gmd:CI_RoleCode/text() ="principalInvestigator" or gmd:role/gmd:CI_RoleCode/text() = "author") and (not(gmd:individualName) or gmd:individualName[@gco:nilReason = "missing"])]/gmd:organisationName/*/text()

amoeba commented 3 years ago

I think things are a bit wonky with the svn->git migration @datadavev. It looks like you already changed the XPath in https://github.com/DataONEorg/d1_cn_index_processor/commit/7d30965e0796f84f520a4b2d7a00c8cd1d1c41cf which is on develop (though the tests weren't updated). And I see new work has been going on develop_2.3. The fix is only on develop but not on main or develop_2.3. It looks like main and develop are fairly out of sync.

Would you mind taking a look?

datadavev commented 3 years ago

develop is set for 2.4-SNAPSHOT and includes a number of changes made by Rob that were not merged into main or the 2.3 stream. The develop_2.3 branch is the branch that is to be merged into main when a release is ready (next one is 2.3.13) until we integrate changes from 2.4 and move on to that stream for further development.

So... 2.3-develop should be up to date with respect to changes to be made for the next patch release of 2.3.x. I don't recall the specifics of 7d30965e0796f84f520a4b2d7a00c8cd1d1c41cf but that is consistent with the history in subversion, with a change being made there on trunk only.

One option for this fix is to create a bug fix branch at 01b053b3506ec0fe6465eba8a72a11630f54217e and merge that to develop_2.3 and main for a bug fix release. That way it will not be dependent on getting the schema.org indexing completed at the same time.

amoeba commented 3 years ago

Thanks @datadavev , that makes sense.

One option for this fix is to create a bug fix branch at 01b053b and merge that to develop_2.3 and main for a bug fix release. That way it will not be dependent on getting the schema.org indexing completed at the same time.

Seems fine to me. I'll get that done shortly.

amoeba commented 3 years ago

Okay, branch made and both merges made. I'm going to leave this open since we still need to release, re-index, and coordinate with Axiom.

amoeba commented 3 years ago

Hey @datadavev, is there anything I can do to help move this along at this point? I'd like to be able to get back to Axiom with a time estimate.

amoeba commented 3 years ago

We talked over this today and @taojing2002 can help patching and reindexing content since I don't have privs. We're going to start with me getting a list of affected content so we know how much reindexing is needed. Since this has been in the air for a while, I'd really like to get this done in the next two weeks. @taojing2002 would that timeline work for you?

amoeba commented 3 years ago

Before patching and reindexing, I wanted to make sure this wouldn't cause any havoc with any existing RW docs. To do that, I ran the head versions of every ResearchWorkspace metadata record through indexing using the updated/correct XPath for the origin field and produced a set of deltas for Chris Turner at Axiom to look over. A lot of content didn't change and the tens (~25%) had fewer creators than before and I didn't find anything that looked odd.

@taojing2002 and I hot-patched stage and production last week and reindexed the above PIDs. I found one difference between my test setup and the result we saw on the CN that resulted in three of the reindexed RW docs being creator-less. I confirmed this was the correct indexing behavior and that my test setup must be a bit wonky. I let Chris Turner know about the three docs and he said he can push updates. Research Workspace should be all set for now.

I also added a test specifically with a Research Workspace doc since it's another variant of ISO we have a lot of content for. See https://github.com/DataONEorg/d1_cn_index_processor/commit/d87c6a709808eb95daae1de9b3ddd334234a441c.