Dataset form unexpectedly creates new nodes for referenced schemas

stefan-korn commented 9 months ago

Describe the bug

Editing an embedded distribution inside a dataset (embedded distribution in dataset is default in DKAN) creates a new distribution node every time something is changed in the distribution. Is this intended behavior? Would not think so?

The reason for this is as far as I can see, that the function checkExistingReference() in Referencer class searches for the distribution title and that is a hash of all the values of the distribution. Now if the values have changed the hash is no longer the same and therefore checkExistingReference() fails and a new distribution is created.

This does not solely affect distributions but would affect any other referenced item that can be edited inside another.

Sorry for maybe asking some stupid questions: Why is this hash of the data used as a title for a referenced node? Couldn't one maybe use UUID as title for referenced nodes? The reference is linked by UUID as identifier anyway?

Steps To Reproduce

Create a dataset with a distribution. check admin/dkan/datasets for distributions and see the distribution appearing once Edit the created dataset and changes something in the distribution check admin/dkan/datasets for distributions again and see there another distribution that has been created

Expected behavior

Do not create a new distribution on editing a distribution inside a dataset.

stefan-korn commented 9 months ago

Please see PR #4055 for an idea how to handle this differently without creating a new distribution (or other embedded elements) every tme.

dafeder commented 8 months ago

@stefan-korn for now it is expected behavior, although we are aware it is counter-intuitive. Because distributions are linked to datasets through a reference, and the reference system as it exists now only uses node UUIDs and not version IDs, there is no way to see previous versions of the dataset if we don't know which version of the distribution to load inside of it. So for now, datasets are versioned but referenced items are not, and are simply saved as new nodes so that both the old and new revisions of the dataset will be dereferenced to show the intended values for the distributions.

I think your PR would suffer from the same issue. We are looking into better solutions for this, since the way it is now does work but doesn't follow expected patterns in Drupal and creates an impractical number of distributions in a lot of cases. One question is, should distributions be referenced at all? Maybe they don't need to be, and we could just store the array of distributions within the dataset node...

If they do need to be stored, we should probably figure out a way to revision them so that we don't keep making new ones. But then we need to rethink references to include the version ID, otherwise we could be showing incorrect data for old revisions of the dataset.

stefan-korn commented 8 months ago

@dafeder : Thanks for the explanation.

What do you mean by

One question is, should distributions be referenced at all? Maybe they don't need to be, and we could just store the array of distributions within the dataset node

No node for the distributions, saving them together with dataset? Technically there is this option now with unchecking this metastore setting?

Dataset properties to be stored as separate entities

Though this probably won't work because of some special handling of the distribution. But in a more general way, I suppose the problem with creating new entities for references is prevalent for other properties too that are stored as separate entities and will be edited inside the dataset. One thing that does look a bit difficult to me is, that the the hash of the properties values is considered for deciding whether a new entity is created or not. Then if you maybe allow only a few values of the property be edited inside dataset and the full range of values only in the separate editing of the property, this maybe cause some troubles (though I am not sure, if this is considered to be valid practice now).

dafeder commented 8 months ago

No node for the distributions, saving them together with dataset? Technically there is this option now with unchecking this metastore setting?

There is. I think the datasets would save but most likely other things would break. Certainly datastore would not work, and the frontend would have to be refactored. But I think in general we are not particularly well-served by having distributions be saved separately. The important thing is the dataset-to-resource relationship and the distribution ID/reference complicates it with no real benefits as far as I can see. So factoring out the distribution-specific code and just finding a way to signal to DKAN where resource URLs can be found in the dataset object would I think resolve a lot of these problems and also open the door to more diverse schema structures we could support.

dafeder commented 8 months ago

Another reason for having distributions decoupled from datasets was that in theory you could have datasets that are published with distributions that are not. But I think few people are doing this and the same thing could be accomplished by having a published version of the dataset w/o the distribution and an unpublished draft that has it.

stefan-korn commented 6 months ago

@dafeder : Coming back to your explanations on distribution I would like to know if you prefer (in the long term) to get rid of any "Dataset properties to be stored as separate entities" in the metastore (/admin/dkan/properties)? Or do you still see this as a valid approach?

I am currently thinking about integrating publishers in search api by providing a SearchApiDatasource and ComplexDataFacade analogous to how it is done with the dataset. If doing this, it would rely on publishers being saved as separate entities. Therefore if DKAN will be going away from the concept of saving dataset properties as separate entities, I maybe would not want to go this way with the Search API integration.

dafeder commented 4 months ago

@stefan-korn it is a bit of an open question still to be honest. I think having distributions as separate entities is maybe more trouble than its worth, but there are a lot of situations where publishers really need to exist in the system somehow independently of the datasets. I would love to hear more about your use case and experience, maybe we can find a way to connect outside of github? :)

stefan-korn commented 4 months ago

@dafeder : Thanks for your reply. We are now relying in some cases on "Dataset properties to be stored as separate entities", so hopefully you will not remove this feature entirely in the future.

I still remember your offer for connection outside Github. We did not manage to get a scheduling on our end yet, but I will hopefully can come back to your offer in near future.

dafeder commented 1 month ago

I am going to close this. Have done some more testing and I think distribution should really not be referenced in most cases. However the datastore has a lot of dependencies on distribution IDs. I think we need to factor those out so that the datastore works with or without distribution references, and then we can stop making those references the default.

stefan-korn commented 1 month ago

@dafeder : I am coming back to this regarding publisher once again (sorry did not manage to get around this earlier)

We are using a publisher as a reference in the dataset, the publisher property is to be stored as separate entity.

We are using list widget with type autocomplete in the dataset to create a new publisher.

After the publisher is created we may edit the publisher node and add some values to properties of the publisher.

Now if the dataset is saved again after that, a new publisher node is created.

This has several drawbacks for us:

the publisher could/should be used on multiple datasets, which works with the list widget. But if the new publisher node is created on one dataset, this gets out of sync with the other datasets.
we added additional fields outside the json to the publisher entity (like a custom description and an icon). Values for this fields get lost during recreation of publisher entity

Currently I am thinking about this workaround for this problem: I do an hook_node_presave on publisher nodes and updating the title of the publisher node with the current metadata hash of the json data property values (if the values changed).

Do you have any opinion/suggestion on this?

Did I understand right that for distribution your plan is to not use "Dataset properties to be stored as separate entities" anymore and store the distribution directly inside the dataset? That seems reasonable, since distributions should usually not exist for themselves.

But still as you also have mentioned above there might be use cases that would benefit from "Dataset properties to be stored as separate entities" in the sense that you want to reuse this property in several places. Regarding this approach the current handling with "checkExistingReference" does not seem to fit.

In a way I suppose it makes a difference if you directly integrate the referenced item in the schema ui, like for distribution or if you provide a list implementation like for publishers. But a distinction on schema ui does not seem practical for this.

dafeder commented 1 month ago

@stefan-korn can you clarify the workflow here? I would think that you could

Create a dataset and indirectly create a publisher node by using the autocomplete widget
Edit the publisher node directly and change the name
Open the dataset in an edit form, and you would see the updated name populated in the publisher field
Make any changes you wish to the rest of the dataset form, outside of the publisher field

...without generating a new publisher node. Is that not the case? Are you saying that after changing a publisher external to the dataset form, you get a duplicate publisher node if you save the dataset form again? I don't understand why the dataset form wouldn't populate with the updated values of the publisher and then keep the publisher as is.

One thing that could be causing this is that the dataset form and the publisher form are saving the publisher JSON differently, resulting in an inconsistent hash. This could be related to a separate problem, which is that the dataset schema needs all the "sub-schemas" to be included even if you have a separate schema file for them, which inevitably leads to some inconsistencies. It's been a very long-outstanding todo to support JSON references in schemas throughout DKAN -- in the different validation points and in the form builder -- so that you would not need this redundancy in your Dataset schema.

I'm going to re-open this issue and update the name. If this would be completely fixed by resolving this issue with the schemas, we can make a new issue for that. If it's something else with the form logic or referencer logic I guess we fix there.

dafeder commented 1 month ago

Did I understand right that for distribution your plan is to not use "Dataset properties to be stored as separate entities" anymore and store the distribution directly inside the dataset? That seems reasonable, since distributions should usually not exist for themselves.

What I am proposing is to continue to make this optional for specific properties. I think in most cases one would not want distributions referenced, but some people might, and there are good reasons to do it for other properties like publisher or topic/theme.

The change would be that unchecking "distribution" in that config would not break the datastore -- if you do it now, none of your CSV files will trigger datastore imports because that's baked into code that assumes a distribution node exists. I have a high-level idea of how this would work but haven't gotten very far on it yet.

stefan-korn commented 1 month ago

@stefan-korn can you clarify the workflow here? I would think that you could
1. Create a dataset and indirectly create a publisher node by using the autocomplete widget

2. Edit the publisher node directly and change the name

3. Open the dataset in an edit form, and you would see the updated name populated in the publisher field

4. Make any changes  you wish to the rest of the dataset form, outside of the publisher field
...without generating a new publisher node. Is that not the case? Are you saying that after changing a publisher external to the dataset form, you get a duplicate publisher node if you save the dataset form again? I don't understand why the dataset form wouldn't populate with the updated values of the publisher and then keep the publisher as is.

One thing that could be causing this is that the dataset form and the publisher form are saving the publisher JSON differently, resulting in an inconsistent hash. This could be related to a separate problem, which is that the dataset schema needs all the "sub-schemas" to be included even if you have a separate schema file for them, which inevitably leads to some inconsistencies. It's been a very long-outstanding todo to support JSON references in schemas throughout DKAN -- in the different validation points and in the form builder -- so that you would not need this redundancy in your Dataset schema.

I'm going to re-open this issue and update the name. If this would be completely fixed by resolving this issue with the schemas, we can make a new issue for that. If it's something else with the form logic or referencer logic I guess we fix there.

@dafeder : Okay, thanks for the explanation. We did indeed tinker quite a lot with schema in our instance, so I just tried with a plain DKAN schema.

Created a dataset and indirectly created a publisher
Edit the publisher node and filled "Parent Organization" with some value
Now edit the dataset again and save
A new publisher node is created

That is the usecase on our side.

But the same happens if I only change the Publisher Name and re-edit the dataset. The dataset form gets the name change in fact, but still a new publisher node is created after saving.

dafeder commented 1 month ago

Can you check your dataset.json and publisher.json schemas and make sure they are completely identical? Also, you could compare the JSON between /api/1/metastore/schemas/publisher/items/[original-publisher-id] and /api/1/metastore/schemas/publisher/items/[duplicate-publisher-id]?

stefan-korn commented 1 month ago

@dafeder : I used the default schema coming with https://github.com/GetDKAN/dkan/tree/2.x/schema/collections now.

Is the problem maybe that the Referencer is checking for the node by the title property of the reference, which is the hashed property values of the reference and the title does not get updated during a direct edit of the referenced entity? https://github.com/GetDKAN/dkan/blob/2.x/modules/metastore/src/Reference/Referencer.php#L448

dafeder commented 1 month ago

Oh -- hmm you are right, sorry I think I missed this in your original report. Yes this is definitely the problem 🤦🏽 .

I am not sure if there is a reason this was never done; it seems kind of obvious that the updateExistingEntity function here should update the title as well. Do you want to try copying the m5d logic from the createNewEntity function and see if that works? If it does and there are no ill-effects or broken tests you could submit a PR.

(Also, aware that the title is a weird place to store the hash, changing that is yet another todo)

stefan-korn commented 1 month ago

@dafeder : Not so easy it seems. When directly editing a publisher node it seems we do not get to updateExistingEntity. Currently I am asking myself in what case we ever get to updateExistingEntity.

Call is coming only from store() and store() is called from createPropertyReference(), which is called from referenceSingle only in the case that no existing reference by uuid is found.

So I did not manage to ever get in updateExistingEntity and surely not for the case of editing a publisher directly.

Should something be added in the LifeCylce preSave phase? But something like every schema except dataset would be needed for this? And this is currently not supported in the Lifecylce, each schema needs to be targeted directly?

dafeder commented 1 month ago

You're right, sorry to lead you in the wrong direction there. That may just be a dead function. Rethinking this, until we have time (hopefully soon) to do a minor overhaul of this whole storage part, it may make the most sense to go with your original idea of simply doing this with a node hook for now. If you've been doing this from a custom module without issue for a while I'd take that as good evidence we can do this in DKAN core safely (and it's already quite broken so "safety" is probably not such a concern). Thanks for your patience on the back and forth here, and please do contribute a PR when you have a chance.

stefan-korn commented 1 month ago

@dafeder I use this code in custom module now:


function custom_module_node_presave(\Drupal\node\NodeInterface $entity)
{
  if ($entity->bundle() === 'data' && $entity->hasField('field_data_type')
    && !$entity->get('field_data_type')->isEmpty() && $entity->get('field_data_type')->value !== 'dataset'
  )
  {
    if ($entity->hasField('field_json_metadata') && !$entity->get('field_json_metadata')->isEmpty()) {
      $json_metadata = json_decode($entity->get('field_json_metadata')->value);
      $entity->title = MetastoreService::metadataHash($json_metadata->data);
    }
  }
}

Looks good so far. Will report if any issues occur.

What do you mean with contributing a PR. The node hook code or towards a "DKAN integrated" solution?

GetDKAN / dkan