Closed tillprochaska closed 1 year ago
@pudo I’ve added this as an issue here mostly to keep track of it as it’s part of a bigger project in Aleph. I’d be happy to put in a PR to resolve this though!
One thing I'd consider before making this guarantee to downstream consumers is how this will intersect with entity fragment aggregation. I've looked at this before a bit in the context of making the pages of a document show up in proper order in the indexText/bodyText props, and could never quite conceptualise all the places in the pipeline that would need to be updated for this guarantee to really be held. It's also something that will be really hard to do wrt. to the statement-based data model I use downstream in nomenklatura, but that's more of a "me" problem :)
One thing I'd consider before making this guarantee to downstream consumers is how this will intersect with entity fragment aggregation. I've looked at this before a bit in the context of making the pages of a document show up in proper order in the indexText/bodyText props, and could never quite conceptualise all the places in the pipeline that would need to be updated for this guarantee to really be held.
Thanks for mentioning this, I wasn’t aware of the (potentially) incorrect order of indexText
property values before. My (maybe naive?) assumption though is that preserving order of property values shouldn’t make anything worse (or might even be necessary in order to be able to tackle them in the future), right?
While it won’t automatically solve the potentially incorrect order of indexText
values (because the values are stored in different fragments), it will solve the issue with previews of multipart emails (because all parts of the original email end up in the same fragment).
@pudo Do you have a preference regarding the alternative data structure?
I have a personal preference for 2 to keep it simple, and I don’t think the perf overhead for O(n) membership tests would be relevant in Aleph. Not sure though about other FtM use cases like OpenSanctions?
The Python library uses sets to store property values. Python sets do not guarantee order:
This can be a problem in some use cases where the order of values is relevant. For example, ingest-file stores the parts of multipart emails as separate values and the order of the parts is relevant.
This could be resolved by using a different data structure to store property values (e.g. using dict keys)