gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Disable indexing of footprintWKT elastic field for livingatlas pipelines #997

Open jack-brinkman opened 7 months ago

jack-brinkman commented 7 months ago

We've run into an issue where elastic indexing fails when trying to index very large footprintWKT fields. Example log snippet below:

Document id aaed8bbb4a046dda5f430b67e4029f8c570e519a: Document contains at least one immense term in field="occurrence.footprintWKT" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[76, 73, 78, 69, 83, 84, 82, 73, 78, 71, 32, 40, 49, 52, 54, 46, 48, 56, 57, 57, 49, 52, 57, 32, 45, 49, 55, 46, 53, 53]...', original message: bytes can be at most 32766 in length; got 107438 (illegal_argument_exception)

I've confirmed with @djtfmartin that this field isn't doesn't need to be searchable/aggregatable, hence, we should disable indexing on it so we can successfully index datasets with these large footprintWKT fields.