Ongoing Stability Issues with Pipelines

charvolant commented 2 years ago

The pipelines system is subject to failure when the systems that it depends upon have troubles. This tends to lead to a failure to ingest data, leading to problems with clients. This is a bad thing.

The common symptom of these failures are missing data resources in index due to processing failures. This causes the index swapping data consistency tests to trip (which is a good thing).

[ ] Unable to contact namematching service local docker instance. This requires a restart of the docker instances.
[x] Image service is very slow
[ ] Image service has problems with a URL changing in a major provider. Solved at the moment by manually updating URLs in the database.
[ ] Images in DwCA are uploaded individually but batch upload to image service during pipelines leads to an invalid upload.
[x] Unable to contact collectory. Contacting the collectory is likely to fail once a day. A new version of the collectory has been deployed. which may fix the problem.
[ ] Spatial portal DNS problem. Potentially a one-off.
- [ ] Investigate whether this is another Docker instance problem
[ ] Sensitive data service Docker image needs restarting

djtfmartin commented 2 years ago

For the name matching service and sds, it might be worth just testing running these services without docker. Docker was only used to make installation a bit simpler on each node.

@vjrj has done some nice work packaging up these 2 services as debian packages, which removes the need to use docker.

Image service has problems with a URL changing in a major provider. Solved at the moment by manually updating URLs in the database.

My own 2 cents here - i think this is a such unique case/event that it is not worth the technical effort of doing anything special. The issue is only with large image datasets and we only have a handful of those. And from those a smaller number that we regularly update. iNaturalist has said the URLs should be stable from now on. A low cost (in terms of technical effort) might be to have a check in the image-loading pipeline (similar to what we do for UUIDs) that stops the image loading if the number of images has changed over a certain threshold (e.g 50%). This can then be manually overridden (as it can be with UUIDs).

vjrj commented 2 years ago

In our side, we have a temporal error 500 in the collectory described here: https://github.com/AtlasOfLivingAustralia/collectory-plugin/issues/155 as a workaround I'm calling the collectory prior to any ingestion in our jenkins as described here: https://github.com/AtlasOfLivingAustralia/collectory-plugin/issues/155#issuecomment-617622134 and we didn't suffer this error again during data processing.

But the error is there and sounds like a initialization/pool-connection temporarily error.

vjrj commented 2 years ago

One thing to take into consideration related with the deb packages is that ala-namematching uses a lucene 8 index and ala-sensitive-service does not yet, if I'm not wrong.

So docker images are using compatible indexes. https://github.com/AtlasOfLivingAustralia/ala-sensitive-data-service/blob/e84bfd71a50fb8f7d9294717ce79a935b9c487fd/ala-sensitive-data-server/docker/Dockerfile#L20 https://github.com/AtlasOfLivingAustralia/ala-sensitive-data-service/blob/e84bfd71a50fb8f7d9294717ce79a935b9c487fd/ala-sensitive-data-server/docker/Dockerfile-test#L17

But if we want to use the debian packages together in the same VM, both should use a lucene 8 index as one package depends in the other.

charvolant commented 2 years ago

Additional error

[INTERPRETED_TO_INDEX] [dr368] ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
    at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
    at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140)
    at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:655)
    at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:274)
    at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
    at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
    at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)

charvolant commented 2 years ago

And

Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: Failed to read data.
    at org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:73)
    at org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:104)
    at org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:92)
    at au.org.ala.pipelines.beam.IndexRecordPipeline.run(IndexRecordPipeline.java:264)
    at au.org.ala.pipelines.beam.IndexRecordPipeline.main(IndexRecordPipeline.java:66)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: Failed to read data.
    at org.apache.beam.runners.spark.io.SourceRDD$Bounded$ReaderToIteratorAdapter.tryProduceNext(SourceRDD.java:201)
    at org.apache.beam.runners.spark.io.SourceRDD$Bounded$ReaderToIteratorAdapter.hasNext(SourceRDD.java:242)
    at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
    at org.apache.beam.runners.spark.translation.SparkProcessContext$ProcCtxtIterator.computeNext(SparkProcessContext.java:138)
    at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
    at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
    at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
    at org.apache.beam.runners.spark.translation.SparkProcessContext$ProcCtxtIterator.computeNext(SparkProcessContext.java:138)
    at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
    at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
    at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.avro.AvroTypeException: Found string, expecting union
    at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
    at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
    at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
    at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
    at org.apache.avro.specific.SpecificDatumReader.readField(SpecificDatumReader.java:116)
    at org.apache.avro.reflect.ReflectDatumReader.readField(ReflectDatumReader.java:310)
    at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
    at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
    at org.apache.beam.sdk.io.AvroSource$AvroBlock.readNextRecord(AvroSource.java:647)
    at org.apache.beam.sdk.io.BlockBasedSource$BlockBasedReader.readNextRecord(BlockBasedSource.java:212)
    at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl(FileBasedSource.java:487)
    at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.advance(OffsetBasedSource.java:258)
    at org.apache.beam.runners.spark.io.SourceRDD$Bounded$ReaderToIteratorAdapter.seekNext(SourceRDD.java:219)
    at org.apache.beam.runners.spark.io.SourceRDD$Bounded$ReaderToIteratorAdapter.tryProduceNext(SourceRDD.java:190)
    ... 29 more

charvolant commented 2 years ago

Problems connecting to the collectory have not manifested since the new year.

charvolant commented 2 years ago

New issue when uploading images inside a DwCA. We upload images separately and then load the DwCA with the expectation that the image service will identify the pre-existing image but that doesn't seem to happen. The batch upload shows invalid status Eg. https://images.ala.org.au/admin/batchUpload/335873815 however we don't know why it is invalid.

charvolant commented 2 years ago

Image service performance now up to snuff

AtlasOfLivingAustralia / data-management

Ongoing Stability Issues with Pipelines #779