DataONEorg / dataone

DataONE information and general-purpose issue tracking
Apache License 2.0
2 stars 0 forks source link

Investigate EML semantic annotation indexing issues on the CNs #15

Open amoeba opened 2 years ago

amoeba commented 2 years ago

@mbjones and @taojing2002 saw some errors reported in the CN indexing logs. @taojing2002 and I looked and couldn't find the errors that were reported but I decided to just go ahead and verify there weren't any issues.

I started by assuming the ADC indexing was working and queried the ADC Solr index for documents with semantic annotations (n=1351). I then checked that each object (1) existed on the CN and (2) was indexed on the CN and (3) if indexed, whether or not it had annotations.

Status Count
Not found 1
Not indexed 58
Indexed, but missing annotations 11
Indexed, w/ annotations 1281

We should manually harvest that one "Not found" object, manually reindex the 58 and 11 above.

Full details, with PIDs ``` NOT FOUND These need to get harvested urn:uuid:a6bbe9d0-c281-4402-bf88-4f3c52c66fda NOT INDEXED These need to get indexed doi:10.18739/A2VM42X50 doi:10.18739/A2P26Q37R doi:10.18739/A2DN3ZW19 doi:10.18739/A2513TW18 doi:10.18739/A23B5W79P doi:10.18739/A27S7HS4C doi:10.18739/A2MC8RG5C doi:10.18739/A2WH2DF2X doi:10.18739/A2J960930 doi:10.18739/A25717N6T doi:10.18739/A2416T00Z doi:10.18739/A21834279 doi:10.18739/A2S46H60V doi:10.18739/A2ZP3W08H doi:10.18739/A2P55DG9K doi:10.18739/A2CV4BR73 doi:10.18739/A20G3GZ2F doi:10.18739/A24746R6K doi:10.18739/A2VX06340 doi:10.18739/A2M61BQ0R doi:10.18739/A2CJ87K8H doi:10.18739/A2DJ58G97 doi:10.18739/A2086356K doi:10.18739/A2930NV40 doi:10.18739/A24B2X49X doi:10.18739/A2SX64931 doi:10.18739/A2599Z20Q doi:10.18739/A2M32N96B doi:10.18739/A2RX93D3S doi:10.18739/A2FJ29C9G doi:10.18739/A23775V6B doi:10.18739/A2QZ22H33 doi:10.18739/A2QV3C418 doi:10.18739/A2WP9T67J doi:10.18739/A2DV1CN8T doi:10.18739/A21G0HV2P doi:10.18739/A2BV79V7V urn:uuid:69a40625-277a-4793-aa10-f148332d2456 doi:10.18739/A2N58CK9B doi:10.18739/A2WS8HM1Z doi:10.18739/A2707WP0Q doi:10.18739/A21J9776B doi:10.18739/A2CN6Z02Z doi:10.18739/A2NC5SC63 doi:10.18739/A2HD7NS6P doi:10.18739/A2GM81P1M doi:10.18739/A24B2X59C doi:10.18739/A2VQ2S97V doi:10.1594/PANGAEA.779181 urn:uuid:29fbd2eb-3319-46ed-b416-3638ec020571 urn:uuid:40b5819c-a8d8-4f82-a9c4-ce2ec6cec1f0 urn:uuid:4cc06919-9562-4b4b-af99-fe524f118181 urn:uuid:d59a7b20-5704-4d37-9ee1-78e7a2e78982 urn:uuid:445deff9-b8cb-4023-8d7e-52802e429358 urn:uuid:b9b256da-0a15-459c-9b5a-36195e0dbb59 urn:uuid:b59de2d0-8531-456f-ab1a-dd009df9c844 urn:uuid:1c6521de-e47e-46ca-b9c8-d3910fe1fa9c urn:uuid:02022a31-97b5-4178-b692-6d2a77c120eb INDEXED, BUT NO ANNOTATIONS These need reindexing and verification after they're reindexed doi:10.18739/A28W3827B doi:10.18739/A2319S30Q doi:10.18739/A2N29P67H doi:10.18739/A2DF6K36X doi:10.18739/A2ZG6G71C doi:10.18739/A20000081 doi:10.18739/A2VM42Z20 urn:uuid:72088082-251e-48f7-be83-9dd7508177e1 urn:uuid:09e1cd68-2209-4f42-ab41-291db507effa doi:10.18739/A2T14TQ57 doi:10.18739/A2445HD27 ```
amoeba commented 2 years ago

@taojing2002 can we work together to reharvest and reindex the PIDs above?

amoeba commented 2 years ago

We're still at:

Status Count
Not indexed 58
Indexed, but missing annotations 11

@taojing2002 and I talked about this and this isn't a quick fix because we don't don't actually have a true reindex operation on the CNs in the sense that we can't trigger the CN to perform the usual index processing we do when an object is created or updated. We have separate tool (d1_index_build_tool.jar) that's run out of band that submits the update directly to Solr which just happens to use most of the same code and data as the actual CN index processor.

Unfortunately, d1_index_build_tool.jar is having an issue specifically with the semantic annotations so we'll need to track that down before we can move forward here. If we can't do that, a workaround would be to update sysmeta on the Member Node for each of the above objects, which would trigger a full sync->harvest->index cycle.

I think d1_index_build_tool is part of https://github.com/DataONEorg/cn-buildout though I'm not sure at this point.