alephdata / followthemoney

Data model and processing tools for investigative entity data
https://followthemoney.tech
MIT License
211 stars 50 forks source link

Maintain insert order for multi-valued properties #1139

Closed tillprochaska closed 1 year ago

tillprochaska commented 1 year ago

The Python library uses sets to store property values. Python sets do not guarantee order:

Python 3.11.3 (main, Apr  7 2023, 21:05:46) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> list({ 3, 2, 1 })
[1, 2, 3]
>>> list({ 1, 2, 3})
[1, 2, 3]

This can be a problem in some use cases where the order of values is relevant. For example, ingest-file stores the parts of multipart emails as separate values and the order of the parts is relevant.

This could be resolved by using a different data structure to store property values (e.g. using dict keys)

tillprochaska commented 1 year ago

@pudo I’ve added this as an issue here mostly to keep track of it as it’s part of a bigger project in Aleph. I’d be happy to put in a PR to resolve this though!

pudo commented 1 year ago

One thing I'd consider before making this guarantee to downstream consumers is how this will intersect with entity fragment aggregation. I've looked at this before a bit in the context of making the pages of a document show up in proper order in the indexText/bodyText props, and could never quite conceptualise all the places in the pipeline that would need to be updated for this guarantee to really be held. It's also something that will be really hard to do wrt. to the statement-based data model I use downstream in nomenklatura, but that's more of a "me" problem :)

tillprochaska commented 1 year ago

One thing I'd consider before making this guarantee to downstream consumers is how this will intersect with entity fragment aggregation. I've looked at this before a bit in the context of making the pages of a document show up in proper order in the indexText/bodyText props, and could never quite conceptualise all the places in the pipeline that would need to be updated for this guarantee to really be held.

Thanks for mentioning this, I wasn’t aware of the (potentially) incorrect order of indexText property values before. My (maybe naive?) assumption though is that preserving order of property values shouldn’t make anything worse (or might even be necessary in order to be able to tackle them in the future), right?

While it won’t automatically solve the potentially incorrect order of indexText values (because the values are stored in different fragments), it will solve the issue with previews of multipart emails (because all parts of the original email end up in the same fragment).

tillprochaska commented 1 year ago

@pudo Do you have a preference regarding the alternative data structure?

  1. Using dict keys (or a small wrapper around dicts)
  2. Using lists (and manually checking if a value already exists when adding new values)
  3. Using a third-party ordered set implementation.

I have a personal preference for 2 to keep it simple, and I don’t think the perf overhead for O(n) membership tests would be relevant in Aleph. Not sure though about other FtM use cases like OpenSanctions?