Closed ghukill closed 6 years ago
It's becoming evident that this is not only desirable, but necessary. A recent harvest of 200k+ records, using the generic XPath based mapper, resulted in an ElasticSearch error of too many fields for an index:
Found unrecoverable error [159.65.241.151:9200] returned Bad Request(400) - Limit of total fields [1000] in index [j122] has been exceeded; Bailing out..
This can be bumped up, but that ignores the problem. The problem of ballooning ES fields names from the generic mapper comes from XML attributes with unique values for eac record. Take this mods:subject
for example:
<mods:subject valueURI="http://id.loc.gov/authorities/subjects/sh2008100297" authorityURI="http://id.loc.gov/authorities/subjects" authority="lcsh">
<mods:topic>Conduct of life</mods:topic>
<mods:genre>Juvenile fiction</mods:genre>
</mods:subject>
This kind of linking to authorities is good and should be encouraged, but results in ES fields names that are not helpful:
mods_subject_@authority=lcsh_@authorityURI=http://id_loc_gov/authorities/subjects_@valueURI=http://id_loc_gov/authorities/subjects/sh2008100297_topic
mods_subject_@authority=lcsh_@authorityURI=http://id_loc_gov/authorities/subjects_@valueURI=http://id_loc_gov/authorities/subjects/sh2008100297_genre
Moving forward with reworking mappers, one addition to the GenericMapper that might help stop this "ballooning" process would be supplying XML attributes to skip. In this case, any attribute valueURI
.
This could be supplied at the time of indexing, or better yet, could be added to the emerging idea of a FieldMapper
model.
After chatting with @colehudson about mapping records, came to ideas and decisions for moving forward:
When the decision to include all attributes from XML elements for mapping was made, the testing records did not include XML elements with unique attribute values across all records. When this occurs -- like in the case of the valueURI
above -- the number of mapped fields that will be indexed in ElasticSearch jumps astronomically. To the point it becomes unfeasible.
However, the GenericMapper is important to Combine, as it allows for bad / messy / unexpected data to surface in the field analysis screens. Finding latent or unknown patterns is one of Combine's strengths, and it would be a shame to lose.
To this end, envisioning that when running a Job a user can opt to "include attributes in generically mapped fields." This would, however, include a warning that explains this could balloon the number of fields. This would also include a text box where users could enter attributes to skip when building generic, dynamic fields names (e.g. valueURI
).
The idea is to allow users to map XPath expressions to ElasticSearch fields that they name. For example //dc:title
could be mapped to the field name they name called dc_title
or even just title
.
This would run in addition to the GenericMapper, but these special mapped fields would be identified in the GUI. Perhaps prefix/suffix field names, or have them stored such that they could be referenced.
In theory this should work if XSLT mappers are supported as well, as those XSLT documents should return XML that looks like:
<fields>
<field name='foo'>value</field>
...
</fields>
but is eventually turned into a dictionary like:
{
'foo':'bar'
}
that could undergo the same secondary treatment to the custom mappers.
These "custom mappers" could also include attributes to skip during GenericMapper, as a means to save those configurations. In this way, these custom mappers could serve as only a means to save attributes to skip, which would be equally handy and a place to store them.
Allow Jobs to be re-indexed in place. Formerly, the idea is that Merge or Analysis Jobs would be used if a new mapping and indexing was desired. But it's become evident that re-indexing with custom or different mappers might be desirable, and would cut down on processing time / DB storage.
Once the ES index was dropped, and the new mapper used, the Job would reload in the GUI with the new indexed documents as if nothing had happened.
Note: This would have bearing on Published Records, as they copy the ES index from a Job wholesale. If deemed necessary, the Job could determined if it was a) published, and b) if so, overwrite those documents in the ES index.
This would all conspire to suggest the GenericMapper can remain hardcoded in core.spark.es
, while these "custom mappers" could be stored in the DB (containing special, named XPath mappings, and attributes to skip for the GenericMapper).
Re-indexing "in place" complete.
For other points, XML2kvp
is adding some rationality to this situation, and is poised to replace the GenericMapper
. It allows for:
The work now is to rework and reword the index mapping throughout.
Mostly implemented with XML2kvp; issues cover other bugs and future work.
Currently, Combine supports two ways of mapping hierarchical XML to flat fields for ElasticSearch:
GenericMapper
MODSMapper
But these are clearly not enough. The
GenericMapper
has performed very well, for a variety of metadata record types. But, by virtue of being generic, lacks the ability to craft ES indexes that are helpful for diferent types of analysis.Currently, these mappers are hardcoded in
core.spark.es
, with the selector looking for classes that extendBaseMapper
. But, if this were to be extended, it's likely that these should be read from the DB.Possible alternate mapper types:
field_name:value
//mods:mods/mods:extension/PID,identifier
where the former is the XPath to use and the latter is the desired ES field nameStarting this issue as a focal point for this exploration of alternate mapping types and approaches.