gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Badly encoded value slipping through to indexing, which fails #1036

Open MattBlissett opened 4 months ago

MattBlissett commented 4 months ago
Caused by: Illegal input [Las Piedritas Caño Uracoa,] UTF-16 codepoint [0x0] at position 1 is a reserved character (illegal_argument_exception)
    at org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:71)
    at org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:104)
    at org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:92)
    at org.gbif.pipelines.ingest.pipelines.OccurrenceToEsIndexPipeline.run(OccurrenceToEsIndexPipeline.java:160)
    at org.gbif.pipelines.ingest.pipelines.OccurrenceToEsIndexPipeline.run(OccurrenceToEsIndexPipeline.java:109)

It's null characters:

grep -Pa '\x00' occurrence.txt | cat -v
E5018165-71F3-40BA-8F91-E77DE59B2BEA    ...   Venezuela  Monagas          L^@a^@s^@ ^@P^@i^@e^@d^@r^@i^@t^@a^@s^@ ^@C^@a^@M-CM-1o^@ ^@U^@r^@a^@c^@o^@a^@,^@ ^@2^@4^@.^@9^@ ^@k^@m^@ ^@S^@W^@ ^@o^@f^@ ^@U^@r^@a^@c^@o^@a.                             ...

Dataset owner contacted, suggested they fix this single record.