NCEAS / metacatui

MetacatUI: A client-side web interface for DataONE data repositories
https://nceas.github.io/metacatui
Apache License 2.0
41 stars 26 forks source link

Add "Replace" option in DataItemView to replace DataONEObjects in a package #653

Closed maier-m closed 4 years ago

maier-m commented 6 years ago

It would be very helpful I think to have a feature that would enable users to swap data files without losing attributes (i.e. instead of just having remove have a change option in the describe dropdown).

This way if someone wants to make a change to the data (say they see a small typo right at the last second or they are using a cloned package as a template) they don't have to recreate all the attribute information if all they need to do is update the data.

laurenwalker commented 6 years ago

Totally agree, we have had plans for a "Replace" feature for a while and it should be fairly easy to do.

FWIW, you can remove a data file and upload one of the same file name and the entity and attribute metadata will stay intact for that new file. (The Describe button will turn green and you can click on it to edit that metadata). It's a sorta-feature. :)

csjx commented 6 years ago

Yes, we've discussed this. I think it would be worthwhile to add the use case into the design docs, similar to the Rename A File use case here: https://github.com/NCEAS/metacatui/blob/master/docs/design/editor/use-cases/rename-a-file.rst

maier-m commented 6 years ago

@laurenwalker Hooray for sorta-features

maier-m commented 6 years ago

I created this ticket to aide in a workflow where a user asked to use an old data package to create a new data package (i.e. use the old package as a template).

I cloned the old package (with fresh pids and obsolescence etc.) and sent the user a link to access the clone to serve as the starting point for the new package (the researcher can change what is necessary and keep what is wanted)

However, I realized that if the researcher wanted to change the data but keep data attributes, this would not be possible currently hence the ticket

mbjones commented 5 years ago

I'm raising this issue again of needing to update existing objects in the editor without losing previously entered metadata (and making sure obsoletes is set properly). I marked it as High priority even though it is currently targeting the 3.2.0 milestone. Let's consider moving it to an imminent release, as it really affects productivity.

laurenwalker commented 4 years ago

I'll put it into the next minor release (2.8.0)

mbjones commented 4 years ago

@amoeba This need to be able to replace data files still a high priority issue that arises pretty much weekly. The most recent feedback came from an Arctic Data Center user today:

See: https://support.nceas.ucsb.edu/rt/Ticket/Display.html?id=18566

I talked to you via email before replacing them - I was hoping that you can reattach these two files on your end without reentering attributes again. Each time I update the file during editing, all attributes associated with the file are removed. I could not figure out a way to work around it, except for asking you directly to upload a file! Maybe you have a way to copy it from the previous edition - nothing changed in terms of attributes.

amoeba commented 4 years ago

I'm blocked on https://github.com/NCEAS/metacatui/issues/1242 though it's nearly done and I'd like to take a crack at this now.

amoeba commented 4 years ago

I got this working earlier this week and we discussed the direction on our dev call today. We decided to make some tweaks to my implementation:

amoeba commented 4 years ago

I just spent a few hours trying to find the right state to put the new DataONEObject into such that the request will be the appropriate PUT with the right settings and haven't quite ironed it out between modifying the url and save methods. I'm guessing I'm pretty close. I ran into one bug in the url method which I'll fix at the same time here.

amoeba commented 4 years ago

Figured out my issues from last week and have replacing working again, as we talked about making it work. Ran into another issue: The editor does everything right until it tries to replace the item again and fails by issuing an invalid getSystemMetadata and updateSystemMetadata call and breaks the ORE. I think the model is just getting saved multiple times which is wrong which sounds like a state issue. Will get back to this tomorrow.

amoeba commented 4 years ago

Okay, it looks like I got it. Ended up finding two things that look like bugs in DataONEObject::url and found I needed to set sysMetaUploadStatus to c on the new Object before saving to to avoid an incorrect updateSystemMetadata call after the MNStorage.update() call. I'm going to test a few more test cases tomorrow.

laurenwalker commented 4 years ago

Hi Bryce - I just tested the Replace feature fairly extensively in Chrome and it works great! I'm excited to get this into the next release. Here are two things that I would suggest changing before we merge into a dev branch:

I wasn't able to test the scenario where someone replaces a file that is in a prov trace, since the prov editor is broken in the UI right now. I'm working on that ticket now (#1325), so once that's fixed and merged in, let's test that.

Another note is that we've been using the feature branch naming convention of having a feature- prefix, so this branch should be named feature-i653-replace-item

amoeba commented 4 years ago

Right on, thanks for taking a look at it @laurenwalker. Those suggestions sound smart and I'll do those today. When I add the better visual indication when replacing a file, I'll see how your idea feels and iterate from there, I think.

Re: the branch naming thing, are you okay if I break the "don't delete branches" policy here to get the name right?

amoeba commented 4 years ago

Okay, I changed things so the formatID is guessed again from the file type and did my best at implementing your suggestion of visual feedback.

Kapture 2020-04-13 at 11 34 08

I went with "Ok!" because "Replaced" doesn't fit and I didn't wanna make any bigger change to the layout. I don't love it because only so many countries use that term. Could either remove the text (just use the icon) or try another word like "Done!".

Thoughts from @laurenwalker , or anyone else?

amoeba commented 4 years ago

One remaining thing I'm stuck on and considering passing on is updating the id attribute for the associated EML entity in the EML XML. Right now, the old id value is retained which is good when the entity has a semantic-ish entity id like "mydata" because it should probably stay the same but bad when the id is autogenerated from the PID for the DataONE Object associated with the entity.

Because we can't assume a user hasn't used a more meaningful id attribute value, clobbering it might be undesirable anyway. The value of the id attribute has dependencies on at least three places:

The reason it's a big tricky is that I think it'd require a change in the DataPackage.save() logic. Right now, the models that are saved as part of the collection are independently saved. With replacing items as implemented, whether or not the metadata should save is dependent on the save success of the item we've replaced. I couldn't find a way to pre-allocate a PID before saving but, right now, the PID for the replaced item is dynamically generated at save via DataONEObject.save's call to .updateId() so I'd have to hook the update to the EML into that which feels too complicated.

I'll chew on this a bit more and would love others' thoughts.

gothub commented 4 years ago

For the pid that is being replaced, are any prov relationships that that pid were contained in either replaced with the new pid or a marked as a revision of the original pid? We have talked a bit about this in the past, not sure what the current thinking is on this.

amoeba commented 4 years ago

Good Q @gothub. Not in my current WIP implementation. Replace has a variety of semantics so we couldn't really pick one for the user and I think we'd need to provide a little UI to get the user to pick the meaning of their replacement.

When I was building this out a few weeks ago I did a mock-up for what I wanted this UI to look like but we opted to go for no UI at the the time of discussing.

Screen Shot 2020-04-14 at 12 32 51 PM
gothub commented 4 years ago

Nice answer @amoeba! The mock-up makes it clear what the implications are of either choice.

The complication with "New Version" is that the prov editor doesn't yet follow prov revision relationships from the old version to the to the most recent version of an object, so the old version would show up in the prov icons, not the new one that is the revision.

laurenwalker commented 4 years ago

I would suggest that in this current implementation, we would update prov relationships with the new pid.

Reason why: In our current WIP implementation of "replace", we are assuming the original data file and the replacement data file are the same conceptual object. Therefore, we are keeping all metadata and relationships previously asserted by the data owner intact. The data owner still has the option to update the metadata and relationships in the metadata and prov editors.

Screen Shot 2020-04-14 at 12 32 51 PM

I think the UI @amoeba proposes is a little too complicated for most people. It looks like a great tool for us NCEAS devs who totally understand the obsolescence chain, but I think for most users, if they click "Replace" next to a file name, they simply want to update that object (Option A, New version).

I would guess that if a user is advanced enough to want to update an unrelated file (Option B), they would upload it separately, then copy over the metadata using a future "Copy metadata from..." option.

We could also achieve Option B by adding a "Duplicate" option under Replace, which duplicates the file in the editor file list, that then can be Replaced by the new file.

I haven't looked at @amoeba's changes with the "Ok!" message yet. I will check out that branch now and take a look.

amoeba commented 4 years ago

Thanks for chiming in @laurenwalker. I think I'm in favor of not updating PROV unless we capture the semantics of the replacement. My reasoning is that PROV relationships contained in one version of a package may not be true in the updated (w/ replacement) package and my preference is to avoid at all costs asserting false PROV statements.

I think our most common Replace use case is a package with an Excel file for its data which is probably 95+% of our examples. Say there's also a Python script in there which reads the file and the user asserted that the Python script used the Excel file. Then the ADC team communicates with the user that we'd like to convert the Excel file to a CSV so someone does that and uses the Replace functionality. If we kept the above PROV statement, it'd now be false. If we retained it as-is, it'd still be true.

This puts users in a situation where they Replace and we need to show in the PROV editor that the relationship still exists and let the user find a way to re-point it if it's true.

If we did go with a routine where we update PROV about the old PID with the new PID, I'm not sure I see how we'd come up with a good algorithm. For example, I don't think we'd want to do a blanket replacement of subjects and objects referencing the old PID with the new PID, it'd need to be a set of specific RDF shapes we'd want to match on.

Is this view too hard-line?

laurenwalker commented 4 years ago

I hear your points and I see what you're saying.

Either way we go, we are taking a risk:

  1. By updating the PROV relationships, we risk making false assumptions about the data.
  2. By not updating the PROV relationships, we risk losing PROV assertions already made by the data owner, or at best, burying it in an older version of the package that no human may ever see.

I personally think that Number 1 (updating the prov relationships) is the way to go, because I would assume that users who have already used the provenance editor and have gone into that level of detail into describing their dataset, understand that they are replacing an object wholesale. I think that that person would expect to see their replacement in the dataset file list, the prov charts, and have all the entity metadata copied over. That seems like a true replace functionality to me.

Say there's also a Python script in there which reads the file and the user asserted that the Python script used the Excel file. Then the ADC team communicates with the user that we'd like to convert the Excel file to a CSV so someone does that and uses the Replace functionality. If we kept the above PROV statement, it'd now be false. If we retained it as-is, it'd still be true.

The issue I see with your example is that if the user truly had a Python script that output an Excel file, I would argue that we keep the Excel file in the package, and upload the .csv as a separate entity. That way we could retain the actual output of the script. OR the user should update their Python script to output a .csv file, which means both the Python script AND the Excel file would need to be Replaced, and the updated PROV relationships would be factually true. (i.e. The csv isGeneratedBy the script)

laurenwalker commented 4 years ago

And of course, the least risky path of action is to ask the user exactly what they want their replace to do, such as with the UI mockup you created. I guess I'm just not convinced that that is the best UX.

laurenwalker commented 4 years ago

I just looked at the new visual feedback and it looks great. I just changed it a bit so that we can squeeze the word "Replaced" in there. :)

Screen Shot 2020-04-15 at 12 37 24 PM

amoeba commented 4 years ago

Thanks for hashing this out with me @laurenwalker. I think you make solid points and I also think we're getting somewhere.

Re:

By not updating the PROV relationships, we risk losing PROV assertions already made by the data owner, or at best, burying it in an older version of the package that no human may ever see.

I didn't fully understand this. I'd assume all PROV statements from the old version of the package would be retained in the new version with our Replaced file(s) but they just wouldn't reference any of those new files. Once the PROV editor supports external resources, it could be made clear to the user they had RDF statements to fix, I'd think.

While I'm in favor of (2), I can totally get behind (1) if that's the way we wanna go.

PS: The change from Ok! to Replaced looks great. Thanks for taking a look.

mbjones commented 4 years ago

Really useful discussion, @amoeba and @laurenwalker.

I think whether to update the prov relationships or not is fairly complicated. Here are three prototypical examples in the figure below, with the initial data package highlighted in the green box, and changed under three scenarios. Solid lines are the original prov relations, dashed are new provenance relations. In the first, a new input data file replaces the original, but the same script is run, generating a new output. In the second, the same input is used, but the script is stochastic, so it generates a new output. And in the third, a new script is used, generating a new output. There are many other variations and reasons for "replacing" a data file, and I think its impossible to come up with a universal rule to indicate which provenance relations should be carried forward, and which are new relationships without user input or execution tracking.

provenance-through-time

mbjones commented 4 years ago

Oh, and note that in all three cases, whichever script is used is in a new Execution. So technically the used and generatedBy triples whenS1 is reused are pointing at new executions that also include S1. These diagrams gloss over the whole execution part of the provenance model.

amoeba commented 4 years ago

Thanks for that, @mbjones.

This sounds right on the money:

I think its impossible to come up with a universal rule to indicate which provenance relations should be carried forward, and which are new relationships without user input or execution tracking.

We had an extended and very useful discussion on today's developer call and I think most of us thought the thing to do is remove by default the provenance about the item we're replacing. This implies the scenario:

  1. P1 contains S1 which uses I1 (prov:used, etc.)
  2. The user replaces I1 with I2 for an unknown reason
  3. The user saves the package

Which results in:

At this point, if the user finds their provenance has been wiped out by mistake, they'd have to use the provenance editor to re-add it. Any provenance statements about objects that weren't replaced would still be there.

Note that this route is different than my preference and also @laurenwalker 's above in this thread but, after talking it out, we came up with this new direction to go.

As for logistics on getting this merged, testing how the Replace feature deals with provenance is dependent on fixing a breaker in the prov editor itself in https://github.com/NCEAS/metacatui/issues/1325. I'd be fine merging this now, fixing the prov editor, then updating this implementation to remove prov about replaced objects.

laurenwalker commented 4 years ago

Thanks Bryce. I just closed out #1325 so you should be able to move forward with removing prov about replaced objects. I think the changes from #1325 shouldn't affect your branch too much.

amoeba commented 4 years ago

Great news, thanks for fixing that bug. I'll start back on this today.

mbjones commented 4 years ago

@amoeba Your plan sounds good.

The one modification I might consider (really an enhancement that could be implemented after your fix was in) would be to use the existing provenance graph in P1 to suggest provenance relationships in P2 for the user to confirm. For example, if the user is replacing I1 with I2, then when they do that, MetacatUI might notice the existing prov:used relation between S1 and I1, and prompt the user: "Was this new object also used by script S1?". If the user answers yes, then we can put in a new S1 prov:used I1 triple (actually, it would be a new Execution of S1 but same idea). If no, then we omit the new triple(s). An analogous prompt would be used for the prov:generated side, and maybe even for prov:wasDerivedFrom. Could that work to leverage existing provenance knowledge to create new provenance knowledge rather than losing it?

I have another thought in terms of which provenance relations to keep in the package as well. In general, once a provenance relationship has been asserted for two immutable objects, it probably remains true for those objects, even if they are obsoleted by new versions. So, there's little harm in keeping around the prov triples (unless the conflict with triples in other packages I suppose). However, they could accumulate, so one rule of thumb might be to only keep prov triples in the package that somehow involve a version of an object that is a member of the package. There may be some weird side effects of a policy like this due to provenance chaining issues with provone:Execution instances being a bit removed from direct associations, but something like this might guide which prov triples to keep in the new version of the package.

amoeba commented 4 years ago

Hey @mbjones, thanks.

Re:

For example, if the user is replacing I1 with I2, then when they do that, MetacatUI might notice the existing prov:used relation between S1 and I1, and prompt the user: "Was this new object also used by script S1?".

This sounds very sane to me, at least for simple situations and smaller packages (which is roughly what the editor is for). It also sounds good to push this to the next version and plan to sketch out a UX.

Re: your idea about keeping PROV triples around after a replace.

When we talked about this, we reasoned that any PROV relationships that were true in version 1 of the package that involve replaced are still asserted to be true via version 1 of the package, even if version 2 of the package doesn't re-assert them. i.e., we don't negate. It's my understanding with how we handle ORES w/ PROV now, the PROV statements form a miniature RDF world unto itself so we'd have some tweaking of how we index to handle this differently. As it stands, it's possible to have multiple ORE versions assert the same PROV over and over again (which isn't bad/wrong imo). @gothub does this sound roughly correct?

amoeba commented 4 years ago

Made some progress on the PROV handling portion of this feature and ran into a DataPackage.save issue seemingly related to this feature, https://github.com/NCEAS/metacatui/issues/1355, so I'm going to have to fix that first.

amoeba commented 4 years ago

I took a look at removing PROV about the package member being replaced and found that the current behavior, though maybe unintentional, may be doing just the right thing. This wraps up the TODOs on this feature.

(I'd love another set of eyes on this or at least ideas for other scenarios to test I'm not thinking of)

It turns out that the PackageModel currently removes triples whose subject or object are not members of the aggregation which is how it manages to keep things organize between saves. When an item is replaced, it is removed from the aggregation and so, on serialization, any provenance (or other) statements are removed from the RDF model managed by the PackageModel.

To test how this works, I ran a few different scenarios and verified the behavior I expected was what I observed and things look good. I tested scenarios where either data or scripts are replaced and where data are being used by scripts or scripts are generating data.

Here are two so you can get an idea of what triples get removed and which are kept:

Scenario A

Scenario B

Is retaining and even re-using the execution okay here? I think that might be the intent in some cases, and not in others. Ultimately, after replacing breaks a PROV relationship, re-adding the relationship makes things right again.

PS: I did run into a serialization exception which caused a save failure and the package members to be orphaned and I put in a helper method to prevent the failure. See 39fb6d653f89340a4bd6b594d4becab2f6777846 for a detailed writeup.

amoeba commented 4 years ago

@laurenwalker how would be best to review/test this before merge for you? I'm happy to submit a PR or just let you or any others test before merging.

amoeba commented 4 years ago

While reviewing ZenHub I realized I hadn't closed out #1355 yet. That'll need closing before this merges. Sorry about that. I'll take a look at #1355 now.

amoeba commented 4 years ago

Okay, #1355 is all sorted. I moved this and #1355 to the Review/QA column in ZenHub.

laurenwalker commented 4 years ago

Thank you! I'm going to try to review this and get merged today.

laurenwalker commented 4 years ago

@amoeba - I merged master into your feature branch and did a bunch of testing. It looks great!

I found 3 bugs (#1393, #1396, #1397) and 1 improvement (#1395) that should be made before we put this into production. If you view this ticket in Zenhub, you'll see I added them as blockers: https://app.zenhub.com/workspaces/data-portal-infrastructure-5d057ceca88c9959aefd0159/issues/nceas/metacatui/653

The 1 improvement (#1395) is indirectly related to replacing files, but I think it's important. While I was testing, I found myself accidentally clicking "Remove" instead of "Replace" more than once. This is the first additional option we've had in the file dropdown menu, so it's kind of a new issue. Especially with both options looking similar (i.e. the words "Replace" and "Remove"), I could see us frustrating our users. If that seems to be a really involved project, we could release it in a patch release. But I just don't want us to forget it.

Thanks!

laurenwalker commented 4 years ago

Oh, I also added #1398, which needs to be done before release as well. Thanks