Question: should we exclude /allow exclusion of JSON elements in the automatic vocabulary generator?

DiegoPino commented 4 years ago

Automatic Vocabulary generation is (in my opinion) the coolest (++factor) feature we have and is becoming almost 2 years old already. But, as cool as it is, we have not given it too much re-use across the stack.

Today while playing EAD V3 import (XML to JSON) via that new Widget i wrote, i found myself producing this vocabulary:

Which, ok. Makes sense, but in strictness is not "our vocabulary" but a particular one of a particular ingest, and we could have quite a lot of different schemas. This also applies to EXIF.

So question is: do we add a form/setting so certain to KEYS become excluded from vocabulary and (hear me out here) also from the JSON KEY flattener? That one that would generate too much memory use to be useful if this goes too deep? i could exclude all the flv: prefixed vocabs, since EXIF tags are not THAT useful really in a vocab.

I know @giancarlobi understands how this works, wonder if @alliomeria knows this/has seen this vocab, builder in the Archipelagos that are accessible by its user and has an opinion?

Ideas? Opinion? Questions?

alliomeria commented 4 years ago

So question is: do we add a form/setting so certain to KEYS become excluded from vocabulary and (hear me out here) also from the JSON KEY flattener? That one that would generate too much memory use to be useful if this goes too deep? i could exclude all the flv: prefixed vocabs, since EXIF tags are not THAT useful really in a vocab.

Having a form/setting to exclude certain KEYS from vocabulary would be a useful feature. Definitely think having the option to exclude all those 'flv' EXIF bits would be especially helpful, considering the maker/model variations encountered in those prefixed vocabs and the questionable value of the info contained within for practical purposes (looking at you 'RichohRoll').

While not Archipelago-specific, I have encountered similar vocabulary/indexing configuration options in ILS/catalog/discovery layers (specific field/tag exclusions, such as those 33x RDA fields in MARC records, local notes).

DiegoPino commented 4 years ago

I agree @alliomeria , thank you. Wonder if the addition of a new excluded key (or prefix) should also dump any vocabularies that already exists. I guess the less effort the user needs to put maybe the better.

DiegoPino commented 4 years ago

@giancarlobi @alliomeria following up here. After installing the new XML Importer Webform element from Beta3 and doing some testing and ingesting a deeply nested EAD 2002 file in found myself not only convinced it was better to exclude by default certain JSON KEYS but it was a need! a Single XML element generated a few thousand taxonomy terms.

This is good because its also hands on proof that: 1.- our JSON strategy is WAY faster and better than the internal SQL based approach that drupal drives for this type of data (Drupal died on me a few times just trying to delete that amount of terms, shame on you Drupal) 2.- That data that is particular to a certain moment/need/single Object (Like this XML example) but also EXIF, metadata settings that can drive display need to be excluded 3.- We can also add a deepness setting. If the metadata/JSON is too deeply nested we can also get rid of the more fine level elements.

I will need now a global Archipelago settings form for this type of things.

esmero / strawberryfield

Question: should we exclude /allow exclusion of JSON elements in the automatic vocabulary generator? #95