4Science / DSpace

This repository contains the 4Science optimized DSpace & DSpace-CRIS distribution.
https://wiki.lyrasis.org/display/DSPACECRIS/
BSD 3-Clause "New" or "Revised" License
42 stars 61 forks source link

bug in handling of optional fields in nested metadata migration (Pentaho transformation step "entities nested placeholder") #373

Open saschaszott opened 1 year ago

saschaszott commented 1 year ago

This bug affects the migration of nested metadata of CRIS entites in DS CRIS 2023.01, especially the handling of optional nested metadata fields, e.g. affiliationstartdate or affiliationenddate.

The step entities nested placeholder in Pentaho transformation entity_migration.ktr is responsible for handling optional nested metadata fields.

image

The result of the underlying SQL query

SELECT DISTINCT
nested_object.parent_id,
nested_object.positiondef,
properties.parent_id as nested_object_id,
--property_def.id AS pdef_id,
--original pdef from where the value is caluclated. Remove because of DISTINCT clause
--property_def2.id AS pdef_id_missing,
properties.visibility,
property_def2.shortname,
property_def2.value AS textvalue
FROM
(
(SELECT * FROM "public".${NESTED_PROP}) AS properties
Left JOIN
(SELECT * FROM "public".${NESTED_TABLE}) AS nested_object ON nested_object.id = properties.parent_id
Left JOIN
(SELECT * FROM "public".${NESTED_PROP_DEF}) AS property_def ON property_def.id = properties.typo_id
Left JOIN
(SELECT * FROM "public".${NESTED_TABLE}_tp2pdef) AS tp2 ON tp2.${PLACEHOLDER_TP} = nested_object.typo_id AND tp2.${PLACEHOLDER_PDEF} != property_def.id
-- CAN't use variables because they contain old_, thus introducing Placeholde variables
--join with non-existing values where not exist
Left JOIN
(SELECT '#PLACEHOLDER_PARENT_METADATA_VALUE#' AS value, shortname, id FROM "public".${NESTED_PROP_DEF}) AS property_def2 ON property_def2.id = tp2.${PLACEHOLDER_PDEF} AND NOT EXISTS(SELECT * from "public".${NESTED_PROP} AS pdef3 WHERE pdef3.parent_id = properties.parent_id AND pdef3.typo_id = property_def2.id)
)

WHERE property_def2.id IS NOT NULL
ORDER BY parent_id ASC;

contains unexpected rows (too many) which subsequently lead to an incorrect migration of RP's affiliations.

The bug arises if a RP has at least one affiliation which does not contain values in all nested metadata fields.

The example below shows what is going wrong in detail.

3 affiliations of a RP in our DS CRIS 5 instance (two of them do not provide values in all fields)

image

Expected behaviour

image

Actual (incorrect) behaviour

image

Behind the scenes the Pentaho transformation generates two additional rows in imp_metadatavalue which can be found in the "technical view":

image

As you can see there are 4 fields of type oairecerif.affiliation.startDate and oairecerif.affiliation.endDate although there are 3 affiliations only.

saschaszott commented 1 year ago

A deeper analysis shows that this bug appears if an affiliation (one row in the UI's affiliations table) has nested metadata fields with different visibilities.