esmero / strawberryfield

A Field of strawberries
GNU Lesser General Public License v3.0
10 stars 5 forks source link

New JSON Key providers needed #46

Open DiegoPino opened 5 years ago

DiegoPino commented 5 years ago

What is needed?

As we move forward and our SBF JSON becomes more rich, we should start thinking and coding new types of JSON Key Provider Plugin implementations. This is related to #33 but goes beyond.

The ones i want and need:

  1. An Entity Reference Property. This is extremely useful to Index in Solr other nodes and entities referenced from inside the JSON to allow Drupal to see them natively. Specially since our JSON graphs are directed, means many times Parents will point to Children, but also on Collection Membership, we want Collection Descriptions to lead, on search, to children.

The implementation is quite simple:

ContentEntityBase provides already a method:

\Drupal\Core\Entity\ContentEntityBase::referencedEntities

Which goes field by field checking for EntityReference Properties

 /**
   * {@inheritdoc}
   */
  public function referencedEntities() {
    $referenced_entities = [];

    // Gather a list of referenced entities.
    foreach ($this->getFields() as $field_items) {
      foreach ($field_items as $field_item) {
        // Loop over all properties of a field item.
        foreach ($field_item->getProperties(TRUE) as $property) {
          if ($property instanceof EntityReference && $entity = $property->getValue()) {
            $referenced_entities[] = $entity;
          }
        }
      }
    }

    return $referenced_entities;
  }

So what we need is a JSON key provider that exposes a set (one or more) JSON property values (node ids for example ) as \EntityReference class properties.

We have at least two ways of providing the JSON keys as arguments:

First, automatic, by using the new "ap:entitymapping": [] key we preprocess (or should because webform maintainer dismissed my pull request for that...gosh)

Or by allowing people simply to type the keys (hopefully in this case a full JSON Path?) that contain entity references. Example are ismemberof, scene, etc. With that Solr will allow us to co-index those referenced entities values, like their labels, etc.

  1. Make use of our SBF Vocabulary generator to automatically expose all Keys we are SURE will exist across the repository. Basically, if you go to http://localhost:8001/admin/structure/taxonomy/manage/strawberryfield_voc_id/overview All those vocabularies were generated by content indexed. So, we should allow people to use that directly too, instead of choosing, adding manually keys.

Here is how i envision that:

Only keys that should be exposed are the leafs of a branch.

So if we have :

What we want to expose is as:document.*.checksum for example, which is really just the value of what is inside .checksum in that hierarchy. That seems also straightforward to do, logic would be

3.- I want an aggregator KeyName provider, one that takes a few different keys from all over the JSON and unites them in a single property to JSON. The UI for that could be a little bit more cumbersome, and thinking loud, it could be even working on Properties we are already exposing via the other KeyName Providers? Or do you think we should keep this one at the same level? Same level means less dependencies.. that is good. After process, means a different level, means the keys can be selected, instead of typed by the user. The need for this is: get all referenced external URLS around the JSON and put then inside a single Solr field named URIS.

Logic here is simple

This plugin takes a bunch of keys, accumulates the values from all of them and then exposes all under a single, different Key name.

Questions:

Do we want to name, prefix, fields coming from a given KeyName provider differently so people can deduce who exposed them? Ideas?

@giancarlobi @marlo-longley

giancarlobi commented 5 years ago

@DiegoPino I start from the last question: I agree to use unique prefix per KeyName provider because I think this can help and make code simpler. I.e. 'ap' prefix is reserved for Archipelago internals. I know people are free to use own vocabulary but we need some fixed point. Regarding 2. I fully agree with that. SBF Vocabulary is dynamic and could grow a lot so exposing only useful keys (leafs with value) will be really appreciated by users. Regarding 3. If I understood, aggregator is something as an harvester from SBF-JSON to collect same type keys as you explain. The need is only to index them or putting them into SBF-JSON? Regarding 1. I think that it would be automatic, almost at the first stage, for internal prefix as 'ap'. Allowing users to set own key for references implicates users very skilled so I'd left that for next step. Finally, sorry for my notes not so deep into code as you, I hope this can help.

DiegoPino commented 5 years ago

Hi @giancarlobi thanks as always!

  1. Ok, that makes sense, using prefixes for each JsonKeyProvider. Since they become dynamic properties of the SBF field, and are not always 1:1 to the actual JSON keys, we can use any prefix convention. We can test and talk about this. It could be the machine name as here http://localhost:8001/admin/structure/strawberry_keynameprovider or something taken automatically from the Plugin type like https://github.com/esmero/strawberryfield/blob/8.x-1.0-beta1/src/Plugin/StrawberryfieldKeyNameProvider/JsonldKeyNameProvider.php#L25
  2. Totally. Yes, i agree. I will plan for that.
  3. The aggregator will work almost the same as the existing KeyName Providers we have like http://localhost:8001/admin/structure/strawberry_keynameprovider/hocr_service/edit?destination=/admin/structure/strawberry_keynameprovider But the main difference is that instead of using one JSON key to fetch data, and the same JSON key as property for the field (what we see when adding properties on our Solr Index) it wil allow to take N number of real JSON properties (image we have many json keys that have Person names, like agent_name, name, creator_name, etc), and their values exposed under one single property/name (e.g aggregated:unified_agents), for example for Solr. So as you say correctly. The need is only index right now where people can see and use the properties right, but in any piece of code they can be accessed too and for the entity one we need to fetch all Referenced Nodes, etc, in a single property. A call to that would be something like $node->get('field_descriptive_metadata')->referencedEntities() which is like doing, e.g $node->get('field_descriptive_metadata')->get('entities:allmynodes')->getValue(); if we name our exposed property in the new KeyNameProvider entities:allmynodes Hope this makes sense.

You analysis is top notch! "strabiliante!" and your suggestions excellent. This all feels disconnected from your strawberry runners work, but it is not. I already have a great idea on how to process data in better ways (JSON) using better exposed properties.

Big hug!

giancarlobi commented 5 years ago

This all feels disconnected from your strawberry runners work, but it is not. I already have a great idea on how to process data in better ways (JSON) using better exposed properties.

I have no doubt that this was related to runners, I fully trust your "fantastiche" ideas, thanks again!

DiegoPino commented 4 years ago

Some advances here! Entity Reference indexing coming from any JSON key via JMESPATH

https://github.com/esmero/strawberryfield/commit/a3a95a02be866de918a7cb08926ecdc8d544261b More soon