ID generation based on Name could clash and result in invalid ro crate JSON

beepsoft commented 1 year ago

I tried to add two different entities: a Person and an Education level. When creating I named both "aaa". With this I ended up with two entities with the same "#aaa", or more precisely a Person entity with "#aaa" but both the Person and the Education level references this Person object, which is wrong. Here's a little video about it:

screencast-docs google com-2023 03 24-16_03_55

Wouldn't it be better to generate ID-s (eg. uuid) randomly and not based on the Name of the entity?

Or optionally a user provided callback function, which can provide the ID for a new entity. It may then decide to have an ID generated based on the name of there's no clash, or in case of clashing generating a randomized ID.

marcolarosa commented 1 year ago

Hmm this is a difficult problem and I'm not sure there's a good solution that doesn't have side effects.

Creating a random ID is easy and can be done easily. But that's not the problem.

Say a user of the app creates a person entity: Jane Doe. Describo internally creates the following JSON-LD snippet:

{ 
  "@id": "#Jane%20Doe",
  "@type": "Person",
  name: "Jane Doe"
}

At this point, the user should edit the entity and assign (preferably) an ORCID to this entity to uniquely identify them (so says the RO Crate spec). However, it's highly likely that the user won't do that. So in that case, having the @id property as "#Jane%20Doe" makes the entity more easily recognisable to the user than if the @idwere some autogenerated, unique id: #sdfkhsufw4r34tnk.

Of course this means that an entity of another type with the same id will result in describo linking the wrong entity rather than creating a new one (expected behaviour; not a bug). But the question is how likely is this to happen (I would argue not very) and if it does, does it justify losing the recognisability of an @id by using auto generated random ids (I don't think so)?

Your example is purely a testing artefact. It's not to say that it can't happen; but I don't think it's very likely is all. And I don't think the tradeoff described above justifies changing the current behaviour.

To your point of generating a random id if there is a clash: easily doable but again, I don't think it's a valid tradeoff given my comments above.

beepsoft commented 1 year ago

When creating a new entity, users must enter its name, which will be displayed to them instead of the ID. For example, if a user creates a Person entity with the following JSON object:

{ 
  "@id": "#sdfkhsufw4r34tnk",
  "@type": "Person",
  name: "Jane Doe"
}

the user will see "Jane Doe" as the linked author, rather than the @id. Additionally, in the "Browse entities" window, the entity will be identified by its name.

Or am I missing something here? I which view would an auto generated ID cause confusions for the user?

marcolarosa commented 1 year ago

No, you're not missing anything. Inside describo they see the name.

I was referring to the underlying data and how that's used or handled outside of describo. I think auto generating id's in the RO crate world is not the done thing (when I first developed describo that's exactly what I did) and we shouldn't be relying on that.

Like I said at the start of my message " I'm not sure there's a good solution that doesn't have side effects.".

marcolarosa commented 1 year ago

A reason for not generating random id's is that it forces the user to actually consider setting the @id property on an entity to something sensible and relevant. If every entity gets a random id then they never need to consider doing this meaning they would be creating new 'versions' of each entity which is also not desirable.

beepsoft commented 1 year ago

I understand your point. Still, wouldn't it be possible to let the application using the crate builder to decide about the ID format? For example, it may automatically lookup or suggest the ORCID for the Person just created and use that as the @id.

May I try to implement this and submit a PR?

marcolarosa commented 1 year ago

Looking up ORCID's is not straightforward and it's hard to disambiguate the results automatically. But it's not just an issue for people. It's an issue for any type of entity.

Feel free to have a look but the problem is not that we can't generate id's. The issue is what it means and what the flow on effects are.

beepsoft commented 1 year ago

Yes, assigning automatically ORCID or else is hardly automatable, that's why I thought of a suggestion mechanism that the application may provide (we have this in other applications dealing with authors and authorships).

I think it is generally a good idea for a component like the crate builder to provide sensible defaults like the current id generation but allow extensions or custom implementations when required. What do you think?

I also feel that some more callbacks/hooks would help make the component more adjustable for custom use cases. For example, a callback before actually adding a context entity could make it possible to set some default values (similar to the id), for example creation date, or name/link of the creator, or anything else the application may provide from some other sources.

marcolarosa commented 1 year ago

Yes, assigning automatically ORCID or else is hardly automatable, that's why I thought of a suggestion mechanism that the application may provide (we have this in other applications dealing with authors and authorships).

I think it is generally a good idea for a component like the crate builder to provide sensible defaults like the current id generation but allow extensions or custom implementations when required. What do you think?

I feel like this has veered off course. The component can easily be adapted to generate unique id's; but as explained above, I think it's a solution that potentially causes other issues. At the moment you've seen an issue in development with a typical developer testing flow. That issue has not been seen in production and I don't believe that it will be.

I also feel that some more callbacks/hooks would help make the component more adjustable for custom use cases. For example, a callback before actually adding a context entity could make it possible to set some default values (similar to the id), for example creation date, or name/link of the creator, or anything else the application may provide from some other sources.

I understand why you want to do this but I would like you to consider the complexity this kind of idea adds.

The user experience would be odd if in some cases an entity is create with id, type and name but in others it magically gets a bunch of other things.
You can already inject things into the crate before you give it to the component. You can also provide your own data packs for entities that the user can lookup and inject into the crate. I expect this existing capability will go a long way to solving at least some of your requirements.
The defining principles when developing this component is that it should be usable by the novice user (it's not a tool for experts) and it should be as self contained as possible. Of course, it would be fairly useless if it couldn't look up data sources but the idea of having hooks in various locations to inject snippets of data I think is too far away from the principle of being self contained. It isn't a framework. It's a metadata editor targeted at the user who is not a metadata expert. Anything that makes the component more complex for that type of user is potentially problematic in my eyes.

marcolarosa commented 1 year ago

Please check this. After some time thinking about it I couldn't come up with a good reason to not implement random id generation if id matches an existing entity but the type is different. I still think this could lead to issues with users not setting sensible id's in place of the random strings but we can always revisit if it introduces unintended consequences.

marcolarosa commented 1 year ago

This should no longer be an issue so closing this ticket. Reopen or create a new one if you see it again.

describo / crate-builder-component

ID generation based on Name could clash and result in invalid ro crate JSON #40