GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
637 stars 100 forks source link

Fix Solr Error when editing NASA Harvest Source (prod) #4236

Closed nickumia-reisys closed 1 year ago

nickumia-reisys commented 1 year ago

How to reproduce

  1. Go to https://catalog-prod-admin-datagov.app.cloud.gov/harvest/edit/nasa-data-json
  2. Try to change harvest frequency (or nothing at all)
  3. Click Save.

Expected behavior

Action Successful

Actual behavior

Unable to update search index.('Solr returned an error: Solr responded with an error (HTTP 400): 
[Reason: Exception writing document id 17ee7418517c3e154b7275c4253b7c0c0d24a5ce to the index; possible analysis error: 
Document contains at least one immense term in field="status" (whose UTF8 encoding is longer than the max length 32766), 
all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: 
\'[123, 34, 106, 111, 98, 95, 99, 111, 117, 110, 116, 34, 58, 32, 49, 53, 44, 32, 34, 108, 97, 115, 116, 95, 106, 111, 98, 34, 58, 32]...\', 
original message: bytes can be at most 32766 in length; got 43571. 
Perhaps the document has an indexed string field (solr.StrField) which is too large]',) 

image

Other notes

Sketch

Ask @FuhuXia

FuhuXia commented 1 year ago

The issue is due to the size of status message from the last job report. It is hard to replicate because other harvest sources, or other jobs from the same NASA source, do not have such big status message.

The screenshot show one sample object_error_summary message. There are 20 of them in the status field, so the total char size is more than 40k, exceeding the solr strField limit 32766. image

FuhuXia commented 1 year ago

The proposed fix is to truncate the large error message. From

message: "Identifier: C1214353986-ASF; Title: UAVSAR_POLSAR_METADATA; 1 Error(s) Found. ### ERROR #1: 'theme':['[\n  "Hayward Fault', 'CA"', '"Laurentides Reserve', 'QC', 'Canada"', '"Capitol Forest', 'WA"', '"Yellowstone National Park', 'WY"', '"Sierra', 'CA"', '"Panhandle', 'FL"', '"Isla de Coiba', 'Panama"', '"Chilean Volcanoes', 'Chile"', '"Oah', 'HI southeast"', '"Sabine Refuge', 'LA"', '"Barrier Islands', 'MS"', '"Northwest Coast', 'FL"', '"Tolima Volcano', 'Colombia"', '"Napo River', 'Peru/Ecuador"', '"SMAP Drought', 'TX"', '"SMAP MOISST Flux Tower Site', 'OK"', '"Buenos Aires Province', 'Argentina"', '"Laguna Del Maule Volcano', 'Chile/Argentin"', '"Reventador Volcano', 'Ecuador"', '"Antuco Volcano', 'Chile"', '"Chillan Volcano', 'Chile"', '"Imbabura Volcano', 'Ecuador"', '"Descabezado Grande Volcano', 'Chile"', '"Cascade Volcanoes', 'WA"', '"Tonzi Ranch', 'CA"', '"Panama Canal forests', 'Panama"', '"Cerro Negro Volcano', 'Colombia/Ecuador"', '"Rosario', 'Argentina"', '"San Antonio de Areco', 'Argentina"', '"Grand Mesa', 'CO"', '"Longview', 'TX"', '"Libreville', 'Gabon"', '"Ogooue River', 'Gabon"', '"Trout Lake', 'Canada"', '"Delta Junction', 'Alaska"', '"Yukon Flats', 'Alaska"', '"Old Crow', 'Canada"', '"Trinity River', 'TX"', '"Sabine River', 'TX"', '"Lloydminster East', 'Saskatoon"', '"South Fort Smith', 'Canada"', '"Innoko Flats"', '"Coldfoot Legacy Line"', '"TomoSAR offset line 64 meters"', '"Teller NGEE"', '"Fuego Volcano', 'Guatemala"', '"Berms TomoSAR 240m baseline"', '"Croatan National Forest', 'NC"', '"Delta Junction NEON site"', '"Ridgecrest', 'CA"', '"Atchafalaya River Delta', 'LA"', '"New Orleans Levee', 'LA"', '"Dominican Republic"', '"La Amistad International Park', 'Panama"', '"Howland Forest', 'ME"', '"Grand County', 'CO"', '"San Joaquin Valley', 'CA"', '"Corcovado National Park', 'Costa Rica"', '"East Central Coast', 'LA"', '"Lanai/Maui/Molokai/Oah', 'HI"', '"Yosemite National Park', 'CA"', '"Florida Keys', 'FL"', '"Barataria Bay', 'LA"', '"Huila Volcano', 'Colombia"', '"Sangay Volcano', 'Ecuador"', '"Hokkaido Volcanoes', 'Japan"', '"PiSAR-L2 Nara Totsukawa-mura', 'Japan"', '"Cordoba Province', 'Argentina"', '"PiSAR-L2 Kumamoto - Aso', 'Japan"', '"Yacamane Volcano', 'Peru"', '"Tutupaca Volcano', 'Peru"', '"Pacific Mangrove'] is not valid under any of the given schemas."

to

message: "Identifier: C1214353986-ASF; Title: UAVSAR_POLSAR_METADATA; 1 Error(s) Found. ### ERROR #1: 'theme':['[\n  "Hayward Fault', 'CA"', '...] is not valid under any of the given schemas."
FuhuXia commented 1 year ago

Above PR truncated individual error messages. With this fix, the whole object_error_summary message size should be greatly reduced.

We need to manaully run the harvest source so that the last job error message is in good shape and source succeed to be re-indexed to solr.

FuhuXia commented 1 year ago

Issue fix verified. NASA Data.json saved as a monthly job. Rebuilding index is fine.

INFO  [ckan.lib.search] Indexing just package 'nasa-data-json'...
INFO  [ckan.lib.search] Finished rebuilding search index.