MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

Explore custom, ad-hoc, or alternative types of ES field mappers #187

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

Currently, Combine supports two ways of mapping hierarchical XML to flat fields for ElasticSearch:

But these are clearly not enough. The GenericMapper has performed very well, for a variety of metadata record types. But, by virtue of being generic, lacks the ability to craft ES indexes that are helpful for diferent types of analysis.

Currently, these mappers are hardcoded in core.spark.es, with the selector looking for classes that extend BaseMapper. But, if this were to be extended, it's likely that these should be read from the DB.

Possible alternate mapper types:

Starting this issue as a focal point for this exploration of alternate mapping types and approaches.

ghukill commented 6 years ago

It's becoming evident that this is not only desirable, but necessary. A recent harvest of 200k+ records, using the generic XPath based mapper, resulted in an ElasticSearch error of too many fields for an index:

Found unrecoverable error [159.65.241.151:9200] returned Bad Request(400) - Limit of total fields [1000] in index [j122] has been exceeded; Bailing out..

This can be bumped up, but that ignores the problem. The problem of ballooning ES fields names from the generic mapper comes from XML attributes with unique values for eac record. Take this mods:subject for example:

<mods:subject valueURI="http://id.loc.gov/authorities/subjects/sh2008100297" authorityURI="http://id.loc.gov/authorities/subjects" authority="lcsh">
    <mods:topic>Conduct of life</mods:topic>
    <mods:genre>Juvenile fiction</mods:genre>
  </mods:subject>

This kind of linking to authorities is good and should be encouraged, but results in ES fields names that are not helpful:

mods_subject_@authority=lcsh_@authorityURI=http://id_loc_gov/authorities/subjects_@valueURI=http://id_loc_gov/authorities/subjects/sh2008100297_topic

mods_subject_@authority=lcsh_@authorityURI=http://id_loc_gov/authorities/subjects_@valueURI=http://id_loc_gov/authorities/subjects/sh2008100297_genre

Moving forward with reworking mappers, one addition to the GenericMapper that might help stop this "ballooning" process would be supplying XML attributes to skip. In this case, any attribute valueURI.

This could be supplied at the time of indexing, or better yet, could be added to the emerging idea of a FieldMapper model.

ghukill commented 6 years ago

After chatting with @colehudson about mapping records, came to ideas and decisions for moving forward:

GenericMapper should skip XML element attributes by default, but allow users to include at index time

When the decision to include all attributes from XML elements for mapping was made, the testing records did not include XML elements with unique attribute values across all records. When this occurs -- like in the case of the valueURI above -- the number of mapped fields that will be indexed in ElasticSearch jumps astronomically. To the point it becomes unfeasible.

However, the GenericMapper is important to Combine, as it allows for bad / messy / unexpected data to surface in the field analysis screens. Finding latent or unknown patterns is one of Combine's strengths, and it would be a shame to lose.

To this end, envisioning that when running a Job a user can opt to "include attributes in generically mapped fields." This would, however, include a warning that explains this could balloon the number of fields. This would also include a text box where users could enter attributes to skip when building generic, dynamic fields names (e.g. valueURI).

Allow users to create custom mapped fields

The idea is to allow users to map XPath expressions to ElasticSearch fields that they name. For example //dc:title could be mapped to the field name they name called dc_title or even just title.

This would run in addition to the GenericMapper, but these special mapped fields would be identified in the GUI. Perhaps prefix/suffix field names, or have them stored such that they could be referenced.

In theory this should work if XSLT mappers are supported as well, as those XSLT documents should return XML that looks like:

<fields>
    <field name='foo'>value</field>
    ...
</fields>

but is eventually turned into a dictionary like:

{
    'foo':'bar'
}

that could undergo the same secondary treatment to the custom mappers.

These "custom mappers" could also include attributes to skip during GenericMapper, as a means to save those configurations. In this way, these custom mappers could serve as only a means to save attributes to skip, which would be equally handy and a place to store them.

Re-run indexing in place

Allow Jobs to be re-indexed in place. Formerly, the idea is that Merge or Analysis Jobs would be used if a new mapping and indexing was desired. But it's become evident that re-indexing with custom or different mappers might be desirable, and would cut down on processing time / DB storage.

Once the ES index was dropped, and the new mapper used, the Job would reload in the GUI with the new indexed documents as if nothing had happened.

Note: This would have bearing on Published Records, as they copy the ES index from a Job wholesale. If deemed necessary, the Job could determined if it was a) published, and b) if so, overwrite those documents in the ES index.

Organization of mappers in Combine

This would all conspire to suggest the GenericMapper can remain hardcoded in core.spark.es, while these "custom mappers" could be stored in the DB (containing special, named XPath mappings, and attributes to skip for the GenericMapper).

ghukill commented 6 years ago

Re-indexing "in place" complete.

For other points, XML2kvp is adding some rationality to this situation, and is poised to replace the GenericMapper. It allows for:

The work now is to rework and reword the index mapping throughout.

ghukill commented 6 years ago

Mostly implemented with XML2kvp; issues cover other bugs and future work.