4Science / DSpace

This repository contains the 4Science optimized DSpace & DSpace-CRIS distribution.
https://wiki.lyrasis.org/display/DSPACECRIS/
BSD 3-Clause "New" or "Revised" License
42 stars 61 forks source link

migration of nested objects in CRIS entities: improper handling of optional nested fields #398

Open saschaszott opened 10 months ago

saschaszott commented 10 months ago

In DSC5 we are using nested objects to model affilations in researcher profiles. Each affiliation consists of 4 fields: org unit (ou pointer; mandatory), role (mandatory), start date (optional), end date (optional).

Currently, we have several affilations without start date and / or end date.

For example, in DSC5 we have one RP with 2 affiliations (screen shot)

image

Currently, the CRIS migration procedure (Pentaho transformation) inverts the order of affilations. This is due to step Sort position in entity_migration.krt (ascending = N)

The expected migration result of the given example RP is:

image

Currently, DSC7 produces an invalid migration result in case of affiliations with optional fields (as in the given example):

image

In this example the assignment of end date is not correct.

This bug is caused by the Pentaho migration step Select values 4 which removes nested_object_id and positiondef in each row of the stream. This means that subsequent migration steps cannot determine the correct assignments of nested fields to a given affiliation (nested object).

This bug affects the migration of RPs that have at least 2 affilations with missing nested fields.

We'll provide a bugfix (adaption of the Pentaho migration).

atarix83 commented 10 months ago

Hi @saschaszott

thanks for opening the issue, anyway this should be already resolved with https://github.com/4Science/DSpace/pull/276.

Feel free to re-open again the issue if is not working for you

saschaszott commented 10 months ago

Hi @atarix83 , the PR (276) you mentioned, does not fix the problem. We have integrated #276 into our code base and are able to reproduce the bug.

saschaszott commented 10 months ago

@atarix83 , the problem is raised in the pentaho transformation step named Select values 4. In this early transformation step the important sorting information in nested_object_id and positiondef is removed. We have fixed the Pentaho transformation (requires additional steps) locally. Let me know if you are interested in a PR.

atarix83 commented 10 months ago

@saschaszott

yes please open a PR when you can, so we can verify. Thanks

saschaszott commented 7 months ago

@atarix83 , sorry for the long pause, but today I was able to reproduce the problem described above with the latest version of DSC (2023.02.02). To illustrate the problem, I'll give you an example of a nested affiliation object (with 3 entries):

image

In the migrated RP you'll find an incorrect state

image

As you can see in the metadata full view, there is an uneven number of affiliation.startDate and affiliation.endDate fields

image

saschaszott commented 7 months ago

To better illustrate the change in the entity migration transformation, I'll provide a before-after comparison of the change in entity-migration.ktr we propse:

Before

image

After

image

saschaszott commented 7 months ago

You can find our proposed bugfix in PR #425 .