Problems when uploading objects in the SBH namespace

nroehner commented 5 years ago

If I have a ComponentDefinition foo/A/2 and I upload it to a Collection on SynBioHub that contains a ComponentDefinition foo/A/1, then a Collection member foo/A/2 is created, but no ComponentDefinition foo/A/2 is created. It appears that SynBioHub thinks that a foo/A/2 ComponentDefinition already exists.

This can be reproduced by uploading the attached file, which contains the ComponentDefinition https://hub.sd2e.org/user/sd2e/design/Strain_7_pJS007_LALT__backbone/2, to the SD2 Design Collection on the staging instance of the SD2 SynBioHub. This Collection already contains a ComponentDefinition https://hub.sd2e.org/user/sd2e/design/Strain_7_pJS007_LALT__backbone/1.

nroehner commented 5 years ago

Strain_7_pJS007_LALT__backbone_annotated_fixed2.zip

cjmyers commented 5 years ago

So, the actual issue is this:

1) If you try to upload objects in the SBH namespace, these are dropped. The assumption is that anything in the SBH namespace must exist or will exist on SBH in the future.

2) However, the members are not dropped, since the assumption is they either exist or will exist in the SBH namespace in the future.

cjmyers commented 5 years ago

I would like to capture our discussion on this issue here with the options that we discussed.

Option 1: Create a landing page after submission that reports the numbers of objects uploaded and lists objects that are dropped. Include a link to the collection page to continue to.

Pros: eliminates silent drops.
Cons: still drops objects and requires an extra click to get to the collection.

Option 2: Check existence of objects that are in SynBioHub namespaces. If they exist and are equal to the ones being submitted, drop them from the submission. If they do not exist or are not equal, then include them in the submission. This means that updates of objects from different namespaces or collections become new objects in this private collection. If the object is in this collection, then the previous one is overwritten, if overwrite is selected, and an error is reported, if not.

Pros: eliminates silent drops and allows the user to work on objects downloaded from SBH without updating the namespace into one of their own.
Cons: allows users to work in SBH namespaces without updating the namespace into one of their own (not good URI practice). Also, fetching and comparing each object within SBH namespaces may add some latency to submission, especially for big collections that include a lot of unchanged SBH objects.

jakebeal commented 5 years ago

I think that option 2 is the better way to go, since it actually eliminates the drops.

I also think this is an important point for starting a transition toward a more principled way of handling differences, permissions, and versions. Right now, our practices tend to confuse together the question of "what is an object" with "where is the object found" --- an issue that's unsurprising given that we're working with URIs. I think we should consider this question not just from the perspective of SynBioHub but more generally in terms of SBOL documents or, more broadly, SBOL knowledge representations.

Consider that every collection of SBOL objects may be viewed as a knowledge representation. One can subset portions of it, refer to portions stored elsewhere, or even modify it. But what is our model of how two such collections relate? Different repository architectures have different models, but one of the most fundamental questions is this: is the relationship symmetric or asymmetric? For example, SVN has an asymmetric model: there is a "master" source, and one can extract and manipulate fragments of it, but they are always defined in reference to that master. On the other hand, git has a symmetric model: two copies of a repository are equals, and patches can move between them in any order.

So I would suggest considering the following highly constrained question, whose answer implies much: if I copy an SBOL document from my laptop to Nic's laptop, should the copy be considered a fork or a cache?

cjmyers commented 5 years ago

I don’t think it is quite like that. The URI does not need to be dereferenable, but it does need to be unique. The idea of restricting the use of a namespace is to ensure uniqueness by requiring that a user of SBOL only mint URIs in a namespace they own.

The URI does not need to say where the file is. Therefore, you copy a file analogy is a cache and not a fork. It only becomes a fork when you start editing it. In good practice, you would not edit it while leaving the URIs unchanged. Ideally, you should bring the URIs into a namespace that you have control over.

The issue is really that for local edits that are not published, the user does not really care so much about URI uniqueness. So, in option 2, the approach is to essentially assume that the URIs are meaningless unless they can be dereferenced to an object in a SynBioHub instance. If they cannot, then they URIs would get minted into your local SBH URI namespace as usual. If they can, then either they should match the existing object OR be in your namespace and you have selected to overwrite.

There is though some risk to this approach. Consider, for example, you download from SBH:

sbh.og/public/A/1 sbh.og/public/B/1

Let’s assume that A references B. Now, assume you open the file and you decide to edit B and you don’t bother to change its URI. Then, you re-upload the file. What should happen?

1) A remains unchanged and continues to reference the original version of B. Namely, only B is upload into your local SBH namespace as a new object.

2) A is changed and references the new version of B. Namely, both A and B are uploaded into your local SBH namespace as both being new objects.

It is not clear to me which would be more intuitive. If you open the file, A will be referring to the new B, so an editor would show this change. However, if you consider a file as a set of objects and you did not edit A, it seems odd to me that A ends up getting edited by editing B. Neither one of these scenarios is very nice.

In the end of the day, the best thing would be if tools managed these issues by pulling things into a local namespace for edits like SBOLDesigner does. However, I agree that this is not an obvious thing for tool developers.

Ok, one more idea, what if SBOL libraries required you to declare your namespace. Then, any object outside your namespace was read-only. Modification procedures could be created that update the namespace when you try to edit.

On May 14, 2019, at 2:53 PM, Jacob Beal notifications@github.com wrote:

I think that option 2 is the better way to go, since it actually eliminates the drops.

I also think this is an important point for starting a transition toward a more principled way of handling differences, permissions, and versions. Right now, our practices tend to confuse together the question of "what is an object" with "where is the object found" --- an issue that's unsurprising given that we're working with URIs. I think we should consider this question not just from the perspective of SynBioHub but more generally in terms of SBOL documents or, more broadly, SBOL knowledge representations.

Consider that every collection of SBOL objects may be viewed as a knowledge representation. One can subset portions of it, refer to portions stored elsewhere, or even modify it. But what is our model of how two such collections relate? Different repository architectures have different models, but one of the most fundamental questions is this: is the relationship symmetric or asymmetric? For example, SVN has an asymmetric model: there is a "master" source, and one can extract and manipulate fragments of it, but they are always defined in reference to that master. On the other hand, git has a symmetric model: two copies of a repository are equals, and patches can move between them in any order.

So I would suggest considering the following highly constrained question, whose answer implies much: if I copy an SBOL document from my laptop to Nic's laptop, should the copy be considered a fork or a cache?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/SynBioHub/synbiohub/issues/916?email_source=notifications&email_token=AA2YH523L4SKWH5MZBSQOA3PVMRD7A5CNFSM4HKABBWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVMYGEY#issuecomment-492405523, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2YH5YBRLUWWGVMBEGJSCLPVMRD7ANCNFSM4HKABBWA.

jakebeal commented 5 years ago

The idea of restricting the use of a namespace is to ensure uniqueness by requiring that a user of SBOL only mint URIs in a namespace they own.

I think that this is my fundamental issue and point of concern. Under this model, I do not understand how collaboration works. Let's say that me and Nic are collaborating on a project. I want to pass back and forth edits with Nic, and both of us may be updating pieces of our SBOL document in parallel. If we were working with a version control models, I'd understand how we deal with editing and merges and conflicts and such --- including the question of uniqueness. But how does this "ownership" model of disambiguation handle that?

(PS: maybe we should move this to a new issue, so that the temporary patch of SynBioHub/synbiohub#916 can be finished at least?)

cjmyers commented 5 years ago

What temporary patch are you referring to? I'm not too excited about jumping to Option 2 above. There are risks and likely performance penalties. The temporary patch is for Nic to change namespace before he edits his objects. This is the current model and any other approach taken by SBH represents a significant departure from this model.

As for collaboration, if you are exchanging an SBOL Document back and forth, as you point out, you have the same issues you have in any version control system. If you want to perform these exchanges via SBH, it will not work well, since a triplestore is not well equipped to be a version control system. Currently, you either have to keep replacing the object after each change losing all your history of edits OR mint a new version each time you exchange and edit. This is not done efficiently, since you end up with deep copies of the entire object each time you make a change. There is not mechanism for managing diffs. Furthermore, when you edit a child object, you necessarily change the parent object, since its reference needs to change.

SynBioHub is currently best suited as a publication mechanism. Namely, objects are uploaded once and shared to the world. This model is relaxed in the private graphs, since you can edit, replace, and delete objects there. However, it can manage this because it knows that it has exclusive control of its URIs. If we allow users of SynBioHub to edit objects in SBH's namespace or even mint URIs in its namespace, then it will be difficult for SBH to maintain object consistency (see my detailed example above).

SBH needs a way to know when objects being presented to it are new or they are references to existing objects. Currently, this is accomplished simply by assuming that all objects in a SBH namespace are meant to be references to existing objects, and all objects outside SBH namespaces are new to the world of SBH and need new SBH URIs. Resolving this in a more complicated fashion such as Option 2 above could work, but it is clearly going to be much more costly computationally and much more difficulty to ensure that expected results are achieved. Finally, it will be a much more difficult model to explain.

jakebeal commented 5 years ago

Somewhere, somehow, we need an SBOL collaborative knowledge curation platform, which will allow people to upload, edit, and access shared data in a safe and effective manner.

Right now, SynBioHub is the closest we've got, and that's why it's getting pushed in this direction. As you point out, its history as a publication platform rather than a collaborative knowledge curation platform has made this difficult. We need to be able to do things like version control.

cjmyers commented 5 years ago

I wonder if there might be someway to integrate with git to track versions. The key thing is that versions that need to be referenced should be published in SynBioHub, but the evolution of the object with each micro change is better maintained in a conventional version control system. I don’t think a triple store is an effective or efficient means of doing version control.

On May 18, 2019, at 2:04 PM, Jacob Beal notifications@github.com wrote:

Somewhere, somehow, we need an SBOL collaborative knowledge curation platform, which will allow people to upload, edit, and access shared data in a safe and effective manner.

Right now, SynBioHub is the closest we've got, and that's why it's getting pushed in this direction. As you point out, its history as a publication platform rather than a collaborative knowledge curation platform has made this difficult. We need to be able to do things like version control.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/SynBioHub/synbiohub/issues/916?email_source=notifications&email_token=AA2YH534SVG4MHOFWTD5JKTPWBAL7A5CNFSM4HKABBWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVWTG6Q#issuecomment-493695866, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2YH53YNUPTSR3C6SH6A23PWBAL7ANCNFSM4HKABBWA.

jakebeal commented 5 years ago

One potential kludge would be to simply sort all of the triples in one of the standard terse triple textual formats. With that canonical ordering, text diffs would be identical to triple diffs, and support git (or any version control)

cjmyers commented 5 years ago

This approach sounds plausible. Feel free to open a new issue for this, if you like. I think it will likely though be a fairly substantial lift, so we may not be able to get to it very soon, depending on relative priorities of other tasks.

On May 21, 2019, at 1:24 PM, Jacob Beal notifications@github.com wrote:

One potential kludge would be to simply sort all of the triples in one of the standard terse triple textual formats. With that canonical ordering, text diffs would be identical to triple diffs, and support git (or any version control)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/SynBioHub/synbiohub/issues/916?email_source=notifications&email_token=AA2YH53TCHLPTL7UUXVEG33PWQV4NA5CNFSM4HKABBWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV4TLYA#issuecomment-494482912, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2YH527MITCO7ZUPPA7MATPWQV4NANCNFSM4HKABBWA.

danielfang97 commented 8 months ago

we will need to look closer at this in the future

SynBioHub / synbiohub3

Problems when uploading objects in the SBH namespace #79