ResearchObject / ro-crate-py

Python library for RO-Crate
https://pypi.org/project/rocrate/
Apache License 2.0
49 stars 26 forks source link

Handle duplicates in property values #132

Open simleo opened 2 years ago

simleo commented 2 years ago
from rocrate.rocrate import ROCrate
from rocrate.model.person import Person

crate = ROCrate()
john = crate.add(Person(crate, "#johndoe"))
jane = crate.add(Person(crate, "#janedoe"))
crate.root_dataset["author"] = [john, jane, john]
crate.root_dataset.properties()
{'@id': './',
 '@type': 'Dataset',
 'datePublished': '2022-07-20T10:25:39+00:00',
 'author': [{'@id': '#johndoe'}, {'@id': '#janedoe'}, {'@id': '#johndoe'}]}

I.e., the JSON-LD is not properly flattened. Note that, while in the above example the API user can easily avoid generating the duplicate, in the general case it may be much trickier to even notice that one is being generated (e.g., subsequent calls to Entity.append_to in different sections of the code).

This should be dealt with in "real time", so that the crate stays flattened at all times and assertions like len(crate.root_dataset["author"]) == 2 don't fail while one is still working on it. Since lookup by value in a list is O(n), extending a property with subsequent calls to append_to would become quadratic. We should therefore switch to sets for property values, which is also closer to their actual semantics, since they have no predefined order. Should we then add support for JSON-LD lists? Are they supported / do they make sense in Schema.org / RO-Crate?

simleo commented 2 years ago

We discussed ordering for multiple-value properties at yesterday's RO-Crate meeting.

simleo commented 1 year ago

We should therefore switch to sets for property values

This is harder than it looks, since Entity uses the underlying JSON dictionary (self._jsonld) for storage (__getitem__ / __setitem__ perform conversions as needed when the value of a property is requested).