SynBioDex / pySBOL3

Native python implementation of SBOL 3.0 specification
MIT License
37 stars 16 forks source link

Stable order in iterators by UID #231

Closed jakebeal closed 2 years ago

jakebeal commented 3 years ago

When generating SBOL material, it's often useful to be able to readily diff an old version and new version of a generated document. Since the ID of children depend on the order in which they are generated, however, changes to the order in which a document is traversed will tend to result in different IDs, which in turn make it much more difficult to compare documents.

This can be greatly simplified if iterators are changed to traverse in a stable order by increasing UID. In this case, any operation in which nothing is changed should also result in an object in which nothing has changed.

tcmitchell commented 3 years ago

More information is needed in order to develop a plan to address this feature request:

jakebeal commented 3 years ago
tcmitchell commented 3 years ago

I think you're going to want to use sorted() to do this yourself. I'm proposing something like the following:

for f in sorted(c.features, key=lambda x: x.identity):
    print(f.identity)

It is best to think of the properties as sets, which are inherently unordered. I understand that there are use cases that want these sets, like Component.features, to be ordered. I do not think the library should guarantee or enforce an ordering though. I think it would be best to allow an application or library to apply an ordering as it sees fit. I think the example above does that.

A wrinkle is that ordering by identity will be a text based sort. SubComponent11 will sort before SubComponent2 in this case. That may or may not be what you want, but the order will be stable, which is clearly desired in your case.

I am open to further discussion. I may be wrong about leaving ordering to code outside of pySBOL3 rather than incorporating ordering within pySBOL3. My inclination, however, is to lean on the graphs of triples that underlie SBOL3 and those are inherently unordered.

jakebeal commented 3 years ago

My past experience with issues like these on other projects is that if we don't have an option for automatic stable ordering, then we are going to end up with difficult-to-resolve heisenbugs because somebody forgot to put the sorting wrapper somewhere and it only activates under very particular circumstances.

Could we have an option that can be turned on for enforcing ordering, like a "safe mode" or "strict mode"?

tcmitchell commented 3 years ago

Can you provide an example of your workflow with a focus on the problematic area? I don't yet understand where things go wrong for you. Pseudocode would be fine.

Object lists should be stable in a given document, but not stable across document read/write loops, or subsequent reads of the same document. Are you serializing a document and then reloading it? Or loading a document a second time into a different Document?

Would the use of sorted N-Triples help you in terms of textual diffing of two files?

jakebeal commented 3 years ago

Here is an example of where the problem comes in. I have created a library of protocol primitives, which I read in in order to build a protocol. When adding an execution of a primitive to a protocol, we go through the inputs in the primitive and add corresponding inputs for the execution.

def make_PrimitiveExecutable(primitive: Primitive, **input_pin_map):
    self = PrimitiveExecutable()
    # link to primitive prototype
    self.instance_of = primitive
    # Instantiate input pins
    for pin_spec in primitive.input:
        # cut/edited some material for brevity
        pin = Pin()
        pin.instance_of = pin_spec
        pin.name = pin_spec.name
        self.input.append(pin)

This means that we end up with different names for the inputs in the executable, because we've created and added the in a different order.

tcmitchell commented 3 years ago

Sorry to be dense, I'm still not seeing it. You don't say where you want the sorting to happen, so I'm guessing you want the primitive.input sorted in the for pin_spec in primitive.input: line.

It looks to me like that loop will create a Pin for every primitive.input, and those Pin instances will each take on the name of the associated primitive.input and have a reference to the primitive.input via the instance_of property. You say you "end up with different names for the inputs on the executable". How are you ending up with different name properties? Is that assignment somehow not happening properly? Or being overwritten somehow?

Or are you trying to get the identity properties of the Pin instances to be constant across reads of your protocol primitives which I'm guessing are stored in SBOL files? And if that's the case, do you really care? Isn't it enough that there is exactly one Pin for each primitive.input, with the correct name property and value_of reference? Do you really care how those Pin instances are labeled?

And does sorted(primitive.input, key=lambda x: x.identity) fix the problem in the short term?

jakebeal commented 3 years ago

I'm guessing you want the primitive.input sorted in the for pin_spec in primitive.input:

Correct.

"end up with different names for the inputs on the executable"

Apologies; that was sloppiness in my writing due to multitasking. I meant the displayId will be different, meaning that the identity will be different, just as you say. This isn't a problem at an abstract level, but it makes writing tests a lot harder.

And yes, applying the sort myself as you suggest will absolutely patch this --- I'm basically hoping that we could pull that pattern up into the library so that I am not vulnerable to bugs due to forgetting to apply the sort pattern.

tcmitchell commented 3 years ago

Maybe we should look at the unit test and see if we can make use of some tooling there to make your unit test robust to the unordered nature of SBOL. I don't see a unit test in https://github.com/SD2E/paml that looks like the code you describe above. Did I miss it? Or is it on an as yet unpushed branch?

I'm thinking of things like unittest.TestCase.assertCountEqual for example. The pySBOL3 make use of that when the members of lists matter and their order does not. Maybe we could structure your test to use that to ensure that there are the right number of Pin instances, their instance_of properties are equivalent to the list of input pin_specs, and so on.


With respect to sorted ordering of iterators in pySBOL3, I recognize that it would be useful for your particular unit test. I'm not so sure how good or bad it would be overall. Would it provide benefit to all or most pySBOL3-based projects? Would it hinder some projects? What is the right sort order? How much code would pySBOL3 have to add relative to the benefit? Should developers really rely on a given sort order in the library, which might or might not be relevant to their application?

In short, I'm still not convinced of the utility.

Do other SBOL libraries enforce a sort order?

jakebeal commented 3 years ago

In the two tests in https://github.com/SD2E/paml/blob/main/test/test_LUDOX_protocol.py, I have a commented out file comparisons at lines 90 and 98. I know that the .ttl file ordering isn't necessarily reliable, but would switch that over to sorted n-triples if the contents were going to be identical.

I'm going to experiment with sorting at some point, and will revisit this with more example information if I make it work on my end.

jakebeal commented 3 years ago

I'm using this workaround code right now, but it's being problematic to make sure that my orderings are actually stable with respect to every set of child objects:

def id_sort(i: iter):
    sortable = list(i)
    sortable.sort(key=lambda x: x.identity if isinstance(x, sbol3.Identified) else x.lookup().identity)
    return sortable
jakebeal commented 3 years ago

I think I've found the problem: the _reparent_child_objects(self) method used by clone walks the child objects in an arbitrary order, and therefore cannot ensure stability.

I need to think about whether the right thing to do now is to keep trying for stability or to make a TopLevel comparator that doesn't care about child object names.

jakebeal commented 3 years ago

I think that we do want to fix this with respect to copy/clone at least, because automatically scrambling the order of child objects internally is something that cannot be worked around.

tcmitchell commented 3 years ago

The library makes no guarantees with respect to order, and the underlying RDF makes no promises with respect to order as far as I know. pySBOL3 is not "automatically scrambling the order of child objects". Rather, it is embracing the inherent unordered nature of the data it is representing. I think your reliance on a particular order is problematic and you will continue to have challenges.

Is the clone operation producing a bad clone of the data? And the root cause is _reparent_child_objects(self)? If so, then that's a bug that should get fixed.

If, on the other hand, clone is producing a valid clone of the data that happens to have different orders on the sets of child objects, and different displayIds as a result, I don't think that's a bug. Does the clone have the same semantic relationships as the original, regardless of displayIds and identities? If so, I think that's reasonable.

jakebeal commented 3 years ago

The semantic relationships are indeed preserved.

I'm worrying about the ability to version control data, however. If I have a workflow for generating a derived SBOL file from a source document, then when I make a change in the source document, I see not only the actual derived changes, but also a large number of changes in the derived document that have nothing to do with the change in the source document.

Here's an example of the challenge that I'm encountering. In this diff: https://github.com/SynBioDex/SBOL-utilities/commit/800bca3d832132c791f8451e73353ca4991178d7, there are 660 lines in the diff (1/3 of a 2097 line file) with no change except for running the utility a second time.

jakebeal commented 3 years ago

This went deeper and more fundamental than I thought. The problem was appearing in cloning and in _reparent_child_objects, but the actual unpredictable renaming happened as a side effect of this line: https://github.com/SynBioDex/pySBOL3/blob/44dea5138b740a7899f1ca856827ea0350958c63/sbol3/toplevel.py#L88

The root problem, then, is the _owned_objects lists being reconstituted in an unpredictable order when the file is loaded. The patch that i'm using to deal with this is that before I clone an object, I sort its _owned_object fields using the following helper code:

def id_sort(i: iter):
    sortable = list(i)
    sortable.sort(key=lambda x: x.identity if isinstance(x, sbol3.Identified) else x)
    return sortable

def sort_owned_objects(self):
    for k in self._owned_objects.keys():
        self._owned_objects[k] = id_sort(self._owned_objects[k])
tcmitchell commented 3 years ago

Yes, file loading happens in an unpredictable order. That's a property of the RDF graphs. I'm not aware of the SBOL specification requiring a specific ordering of objects, and if you were to access the properties in an RDF graph you wouldn't get them in a specific order either. Is your patch sufficient to work around this issue in your code, or are you still requesting a change to pySBOL3? And if so, what kind of change?

jakebeal commented 3 years ago

The patch will do for now, but it feels very dirty and kludgey. I propose that we leave this issue open for a while longer while I see how my usage evolves, and we can figure out from that whether this goes into the library or becomes a utility function.

jakebeal commented 3 years ago

I have come around to the idea that we don't need this for most cases, the sbol-utilities id_sort function is sufficient.

However: we do need stability when copying an object, to make sure that the structure of the object doesn't change when it moved to a new copy/clone.

jakebeal commented 2 years ago

Test case: put a complex object into a document. Copy it to another document and write to n-triples. The triples should be identical for both documents. If this bug is extant, then they will be different due to child objects being added (and named) in a different order than they were originally.

We need to make sure that there are at least 10 child objects of the same type too, to make sure that sorting doesn't go 1, 10, 2...