Closed jakebeal closed 2 years ago
More information is needed in order to develop a plan to address this feature request:
obj.identity
)?for f in component.features:
?)I think you're going to want to use sorted()
to do this yourself. I'm proposing something like the following:
for f in sorted(c.features, key=lambda x: x.identity):
print(f.identity)
It is best to think of the properties as sets, which are inherently unordered. I understand that there are use cases that want these sets, like Component.features
, to be ordered. I do not think the library should guarantee or enforce an ordering though. I think it would be best to allow an application or library to apply an ordering as it sees fit. I think the example above does that.
A wrinkle is that ordering by identity
will be a text based sort. SubComponent11
will sort before SubComponent2
in this case. That may or may not be what you want, but the order will be stable, which is clearly desired in your case.
I am open to further discussion. I may be wrong about leaving ordering to code outside of pySBOL3 rather than incorporating ordering within pySBOL3. My inclination, however, is to lean on the graphs of triples that underlie SBOL3 and those are inherently unordered.
My past experience with issues like these on other projects is that if we don't have an option for automatic stable ordering, then we are going to end up with difficult-to-resolve heisenbugs because somebody forgot to put the sorting wrapper somewhere and it only activates under very particular circumstances.
Could we have an option that can be turned on for enforcing ordering, like a "safe mode" or "strict mode"?
Can you provide an example of your workflow with a focus on the problematic area? I don't yet understand where things go wrong for you. Pseudocode would be fine.
Object lists should be stable in a given document, but not stable across document read/write loops, or subsequent reads of the same document. Are you serializing a document and then reloading it? Or loading a document a second time into a different Document
?
Would the use of sorted N-Triples help you in terms of textual diffing of two files?
Here is an example of where the problem comes in. I have created a library of protocol primitives, which I read in in order to build a protocol. When adding an execution of a primitive to a protocol, we go through the inputs in the primitive and add corresponding inputs for the execution.
def make_PrimitiveExecutable(primitive: Primitive, **input_pin_map):
self = PrimitiveExecutable()
# link to primitive prototype
self.instance_of = primitive
# Instantiate input pins
for pin_spec in primitive.input:
# cut/edited some material for brevity
pin = Pin()
pin.instance_of = pin_spec
pin.name = pin_spec.name
self.input.append(pin)
This means that we end up with different names for the inputs in the executable, because we've created and added the in a different order.
Sorry to be dense, I'm still not seeing it. You don't say where you want the sorting to happen, so I'm guessing you want the primitive.input
sorted in the for pin_spec in primitive.input:
line.
It looks to me like that loop will create a Pin
for every primitive.input
, and those Pin
instances will each take on the name
of the associated primitive.input
and have a reference to the primitive.input
via the instance_of
property. You say you "end up with different names for the inputs on the executable". How are you ending up with different name
properties? Is that assignment somehow not happening properly? Or being overwritten somehow?
Or are you trying to get the identity
properties of the Pin
instances to be constant across reads of your protocol primitives which I'm guessing are stored in SBOL files? And if that's the case, do you really care? Isn't it enough that there is exactly one Pin
for each primitive.input
, with the correct name
property and value_of
reference? Do you really care how those Pin
instances are labeled?
And does sorted(primitive.input, key=lambda x: x.identity)
fix the problem in the short term?
I'm guessing you want the primitive.input sorted in the for pin_spec in primitive.input:
Correct.
"end up with different names for the inputs on the executable"
Apologies; that was sloppiness in my writing due to multitasking. I meant the displayId
will be different, meaning that the identity
will be different, just as you say. This isn't a problem at an abstract level, but it makes writing tests a lot harder.
And yes, applying the sort myself as you suggest will absolutely patch this --- I'm basically hoping that we could pull that pattern up into the library so that I am not vulnerable to bugs due to forgetting to apply the sort pattern.
Maybe we should look at the unit test and see if we can make use of some tooling there to make your unit test robust to the unordered nature of SBOL. I don't see a unit test in https://github.com/SD2E/paml that looks like the code you describe above. Did I miss it? Or is it on an as yet unpushed branch?
I'm thinking of things like unittest.TestCase.assertCountEqual for example. The pySBOL3 make use of that when the members of lists matter and their order does not. Maybe we could structure your test to use that to ensure that there are the right number of Pin
instances, their instance_of
properties are equivalent to the list of input pin_specs, and so on.
With respect to sorted ordering of iterators in pySBOL3, I recognize that it would be useful for your particular unit test. I'm not so sure how good or bad it would be overall. Would it provide benefit to all or most pySBOL3-based projects? Would it hinder some projects? What is the right sort order? How much code would pySBOL3 have to add relative to the benefit? Should developers really rely on a given sort order in the library, which might or might not be relevant to their application?
In short, I'm still not convinced of the utility.
Do other SBOL libraries enforce a sort order?
In the two tests in https://github.com/SD2E/paml/blob/main/test/test_LUDOX_protocol.py
, I have a commented out file comparisons at lines 90 and 98. I know that the .ttl file ordering isn't necessarily reliable, but would switch that over to sorted n-triples if the contents were going to be identical.
I'm going to experiment with sorting at some point, and will revisit this with more example information if I make it work on my end.
I'm using this workaround code right now, but it's being problematic to make sure that my orderings are actually stable with respect to every set of child objects:
def id_sort(i: iter):
sortable = list(i)
sortable.sort(key=lambda x: x.identity if isinstance(x, sbol3.Identified) else x.lookup().identity)
return sortable
I think I've found the problem: the _reparent_child_objects(self)
method used by clone
walks the child objects in an arbitrary order, and therefore cannot ensure stability.
I need to think about whether the right thing to do now is to keep trying for stability or to make a TopLevel comparator that doesn't care about child object names.
I think that we do want to fix this with respect to copy/clone at least, because automatically scrambling the order of child objects internally is something that cannot be worked around.
The library makes no guarantees with respect to order, and the underlying RDF makes no promises with respect to order as far as I know. pySBOL3 is not "automatically scrambling the order of child objects". Rather, it is embracing the inherent unordered nature of the data it is representing. I think your reliance on a particular order is problematic and you will continue to have challenges.
Is the clone
operation producing a bad clone of the data? And the root cause is _reparent_child_objects(self)
? If so, then that's a bug that should get fixed.
If, on the other hand, clone
is producing a valid clone of the data that happens to have different orders on the sets of child objects, and different displayIds as a result, I don't think that's a bug. Does the clone have the same semantic relationships as the original, regardless of displayIds and identities? If so, I think that's reasonable.
The semantic relationships are indeed preserved.
I'm worrying about the ability to version control data, however. If I have a workflow for generating a derived SBOL file from a source document, then when I make a change in the source document, I see not only the actual derived changes, but also a large number of changes in the derived document that have nothing to do with the change in the source document.
Here's an example of the challenge that I'm encountering. In this diff: https://github.com/SynBioDex/SBOL-utilities/commit/800bca3d832132c791f8451e73353ca4991178d7, there are 660 lines in the diff (1/3 of a 2097 line file) with no change except for running the utility a second time.
This went deeper and more fundamental than I thought. The problem was appearing in cloning
and in _reparent_child_objects
, but the actual unpredictable renaming happened as a side effect of this line: https://github.com/SynBioDex/pySBOL3/blob/44dea5138b740a7899f1ca856827ea0350958c63/sbol3/toplevel.py#L88
The root problem, then, is the _owned_objects
lists being reconstituted in an unpredictable order when the file is loaded. The patch that i'm using to deal with this is that before I clone an object, I sort its _owned_object
fields using the following helper code:
def id_sort(i: iter):
sortable = list(i)
sortable.sort(key=lambda x: x.identity if isinstance(x, sbol3.Identified) else x)
return sortable
def sort_owned_objects(self):
for k in self._owned_objects.keys():
self._owned_objects[k] = id_sort(self._owned_objects[k])
Yes, file loading happens in an unpredictable order. That's a property of the RDF graphs. I'm not aware of the SBOL specification requiring a specific ordering of objects, and if you were to access the properties in an RDF graph you wouldn't get them in a specific order either. Is your patch sufficient to work around this issue in your code, or are you still requesting a change to pySBOL3? And if so, what kind of change?
The patch will do for now, but it feels very dirty and kludgey. I propose that we leave this issue open for a while longer while I see how my usage evolves, and we can figure out from that whether this goes into the library or becomes a utility function.
I have come around to the idea that we don't need this for most cases, the sbol-utilities id_sort function is sufficient.
However: we do need stability when copying an object, to make sure that the structure of the object doesn't change when it moved to a new copy/clone.
Test case: put a complex object into a document. Copy it to another document and write to n-triples. The triples should be identical for both documents. If this bug is extant, then they will be different due to child objects being added (and named) in a different order than they were originally.
We need to make sure that there are at least 10 child objects of the same type too, to make sure that sorting doesn't go 1, 10, 2...
When generating SBOL material, it's often useful to be able to readily diff an old version and new version of a generated document. Since the ID of children depend on the order in which they are generated, however, changes to the order in which a document is traversed will tend to result in different IDs, which in turn make it much more difficult to compare documents.
This can be greatly simplified if iterators are changed to traverse in a stable order by increasing UID. In this case, any operation in which nothing is changed should also result in an object in which nothing has changed.