Replacing a file with a duplicate breaks DataONEObj-EMLEntity relationship

laurenwalker commented 4 years ago

This may be an edge case, but since it breaks the ability to properly edit the dataset, I think it should be fixed.

Here is the scenario:

I have a dataset with three files, Data1.csv, Data2.csv, Data3.csv
I open the dataset in the editor
I replace Data1.csv with a file that is named Data3.csv (the contents of the file are irrelevant, but the file name has to be a duplicate of another in the package).
So now I have 3 files in my package: Data3.csv, Data2.csv, Data3.csv
Click submit
When I view the dataset, the "More info" link for the replaced Data3.csv (previously Data1.csv) is missing. If I open it back up in the editor, I cannot edit my attributes for this file since the editor has lost the connection to it's EMLEntity.

Expected behavior:

If I replace a file with one that has the same file name as another object in the package, the package should stay intact so that the More Info links and Attribute editor still works for all those objects.
Linkages between DataONEObjects and EMLEntities should exist regardless of file names, checksums, etc. as much as possible. We have been using the entity id attribute to connect objects, so that should be working well enough for duplicate file names to exist without breaking that linkage.

amoeba commented 4 years ago

Super nice catch, thank you! It's not readily apparent what's going on so I'll have a look and update back here.

amoeba commented 4 years ago

Okay so this interacts with something I was already aware of but this breaks it in a funny way because of the shared file name. A few places in MetacatUI have baked in algorithms for matching up items in the package with entity descriptions in the EML. The algorithms include comparing PIDs, checksums, and even file names. When doing the replacement of an item in the package, I didn't update the id attribute on the entity because it may have meaning to the researcher and may be referenced elsewhere and an update may break other things like annotations and references.

Why replacement of a file with another which has a unique name in the package works is because MetacatUI can find the entity without the right id attribute so long as the entityName matches the fileName. Why your example breaks this is because the fileName is used twice in the EML.

Matching package members to EML entities based upon the id attribute, checksums, or filenames is flakey and the better way would be to connect them via a physical section by their PID. We don't currently serialize physical sections so this is a bit of a gap.

If we simply want to always update the id attribute on the entity to match the new item's PID, we need to know the PID of the new item before we serialize the EML when we save it. This is actually also true if we were to generate physical sections too: We need to know the new PID and the new checksum prior to serializing the EML.

To pile it on (sorry), https://github.com/NCEAS/metacatui/issues/1394#issuecomment-631147106 is also relevant here: EML entities are not necessarily 1:1 with DataONE Objects. Multiple entities may have identical physical sections.

I think the quick thing to do here is to pre-determine the next PID for the object being replaced and refactor relevant parts of the various models to use this pre-determined PID for both the new object and in the EML so they can match up. What do you think?

laurenwalker commented 4 years ago

Thanks for the detailed write-up, Bryce.

I think this is the way to go:

I think the quick thing to do here is to pre-determine the next PID for the object being replaced and refactor relevant parts of the various models to use this pre-determined PID for both the new object and in the EML so they can match up.

The flow right now is that when a new file is uploaded, a DataONEObject model is created. The DataONEObject default attribute for id is a generated uuid:

https://github.com/NCEAS/metacatui/blob/92edadcafcbb04dec6ddc8e24b48cc80fff79051/src/js/models/DataONEObject.js#L54

So when the EMLEntity is created later, it uses that pre-generated id. The creation of the EMLEntity is orchestrated by the DataPackage collection in DataPackage.handleAdd():

https://github.com/NCEAS/metacatui/blob/92edadcafcbb04dec6ddc8e24b48cc80fff79051/src/js/collections/DataPackage.js#L2846

I think what we may have to do is call DataONEObject.updateID() as soon as the replace is complete, and update the EMLEntity id attribute with that new id. Then during DataONEObject.save(), we will need a new check to see if the id needs updated or not (and skip the call to DataONEObject.updateID() if it's already been done). We might have to alter DataONEObject.updateID() so that it sets a flag on the model that the id has already been updated (which gets reset after a successful save).

As far as this:

I didn't update the id attribute on the entity because it may have meaning to the researcher and may be referenced elsewhere and an update may break other things like annotations and references.

Before updating the id, we could check that the id attribute matches the current object pid. That way we can keep any custom id attributes intact.

amoeba commented 4 years ago

Cool, I hadn't thought about going about it that way but it sounds better than what I was thinking of doing. I'll see how that works.

Before updating the id, we could check that the id attribute matches the current object pid. That way we can keep any custom id attributes intact.

Sounds good.

Thank you!

amoeba commented 4 years ago

Okay, I implemented the check to see if the ID needs updating. To avoid adding yet another state attribute to the model, I re-used oldPid: If set, don't update again. If not set, update. And then, on save or error, reset oldPid. Let me know what you think.

Re:

Before updating the id, we could check that the id attribute matches the current object pid. That way we can keep any custom id attributes intact.

I found that EML211.getEntity actually already walks over custom ids. And allowing id attributes that can't be matched to an object in the package causes all kinds of havoc anyway (broken prov editor, creation of a new entity in the editor). So I think, for now, we want to always keep the id attribute in sync with the object. When we add in support for physical sections, we might relax this. Sound good?

laurenwalker commented 4 years ago

That all sounds good to me, Bryce Thank you

laurenwalker commented 4 years ago

I tested your feature branch today and encountered a save error:

The requested identifier urn:uuid:495e55ad-ceb4-4fdb-9a81-d341085d5df5 is already used by another object andtherefore can not be used for this object. Clients should choosea new identifier that is unique and retry the operation or use CN.reserveIdentifier() to reserve one.

To reproduce:

Upload a dataset with Data1.csv, Data2.csv. All good so far.
Replace Data2.csv with a file named Data1.csv, so we now have two Data1.csv files in the package. All good so far.
Click "Add file" and add another Data1.csv. Click Submit, and get the above error.

It looks like the editor is trying to reuse an identifier from one of the other Data1.csv files.

amoeba commented 4 years ago

Thanks much for testing. Though, I can't reproduce. I tried a few permutations of your above steps since I wasn't sure exactly what you did and I can't see any issues. I

Go to submit to create a new package
Fill in basic metadata, add files named Data1.csv and Data2.csv in the same Add Files operation
Save the package
Edit the updated package
Replace Data2.csv with a file named Data1.csv
Add a file named Data1.csv to the package
Submit the package

I also tried a variant where I saved between 5 and 6, which also worked. Wanna give it one more shot or hop a video call?

amoeba commented 4 years ago

@laurenwalker and I hopped on a call this morning and she showed me her steps. The key was not routing away from the editor after the a save and the bug isn't specifically to do with replacing an item but does have to do with my recent change to how oldPid and updateID works. In that change, I didn't the propagate the post-save reset (model.set("oldPid", null)) to the EML and PackageModel models too which needed to be done.

I've tested this and it should be good now. Thanks for helping debug, @laurenwalker .

NCEAS / metacatui

Replacing a file with a duplicate breaks DataONEObj-EMLEntity relationship #1397