AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing
Other
7 stars 24 forks source link

Cases where Scala 'catch Exception' and doesn't rethrow #249

Closed ansell closed 2 years ago

ansell commented 6 years ago

There are a number of cases where the current Scala code uses the following pattern to swallow java.lang.Error, which includes both java.lang.OutOfMemoryError and java.lang.StackOverflowError that biocache-store does not need to recover from.

In some cases, java.lang.Error instances are wrapped inside of other Exception instances such as java.util.concurrent.ExecutionException, which are caught by the current Scala code which catches all java.lang.Exception instances, in most cases logging them and not rethrowing them.

Not exiting quickly or recovering from these and other Error's have caused non-completion which requires manual JVM shutdowns that interrupt other tasks and may require full server restarts, as biocache-store is used as the basis for biocache-service that is designed for long running application lifetimes:

  } catch {
       case e: Exception => {
          logger.error(...)
  }

In some cases, these are wrapped with retry loops, which could complete in future by exiting the loop, but others are more risky and completion isn't guaranteed, or data is not loaded/processed/sampled/indexed.

The catch case e: Exception pattern is currently used 248 times in biocache-store (2.x). This issue will be used to track fixes for them where they are not already accommodated for, rather than creating many issues that may be each given low priority or status.

ansell commented 6 years ago

The Mocks class used for testing ignores java.lang.Error and does not fail tests after attempting to create an ALANameSearcher object and failing:

src/test/scala/au/org/ala/biocache/Mocks.scala-25-    try {
src/test/scala/au/org/ala/biocache/Mocks.scala-26-      val nameIndex = new au.org.ala.names.search.ALANameSearcher(properties.getProperty("name.index.dir"))
src/test/scala/au/org/ala/biocache/Mocks.scala-27-      bind(classOf[au.org.ala.names.search.ALANameSearcher]).toInstance(nameIndex)
src/test/scala/au/org/ala/biocache/Mocks.scala-28-    } catch {
src/test/scala/au/org/ala/biocache/Mocks.scala:29:      case e: Exception => e.printStackTrace()
src/test/scala/au/org/ala/biocache/Mocks.scala-30-    }
ansell commented 6 years ago

The MigrateMedia class does not retry media saving when failures occur:

src/main/scala/au/org/ala/biocache/tool/MigrateMedia.scala-47-                  try {
src/main/scala/au/org/ala/biocache/tool/MigrateMedia.scala-48-                    Config.mediaStore.save(uuid, dataResourceUid, "file://" + filePath, None)
src/main/scala/au/org/ala/biocache/tool/MigrateMedia.scala-49-                  } catch {
src/main/scala/au/org/ala/biocache/tool/MigrateMedia.scala:50:                    case e: Exception => println(s"Problem saving $uuid, $dataResourceUid, $filePath")
src/main/scala/au/org/ala/biocache/tool/MigrateMedia.scala-51-                  }
ansell commented 6 years ago

The IndexLocalRecords class does not retry indexing of records when failures occur:

src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala-115-          indexer.indexFromMap(guid, map, batchID = tokenRangeIdx)
src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala-116-
src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala-117-
src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala-118-
src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala-119-          count += 1
src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala-120-          if(count % 100000 == 0){
src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala-121-            ZookeeperUtil.setStatus("INDEXING", "INDEXING", count)
src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala-122-          }
src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala-123-        } catch {
src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala:124:          case e: Exception => {
src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala-125-            logger.error("Problem indexing record: " + guid + " - " + e.getMessage(), e)
src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala-126-          }
src/main/scala/au/org/ala/biocache/tool/IndexLocalRecords.scala-127-        }
ansell commented 6 years ago

The Sampling class does not retry loading of points from the file when failures occur:

src/main/scala/au/org/ala/biocache/tool/Sampling.scala-252-    logger.info("Loading points from file: " + filePath)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-253-    var current: Array[String] = csvReader.readNext
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-254-    val points: ArrayBuffer[Array[Double]] = new ArrayBuffer[Array[Double]]
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-255-    var count = 0
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-256-    while (current != null && count < max) {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-257-      try {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-258-        if(current.length == 2 && current(0) != "null" && current(0) != "null"){
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-259-          points += current.map(x => x.toDouble)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-260-        }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-261-      } catch {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala:262:        case e: Exception => logger.error("Error reading point: " + current, e)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-263-      }
ansell commented 6 years ago

The Sampling class does not fail when writing points to a file fails (and also doesn't close the FileWriter object when failures occur and are silently ignored).

src/main/scala/au/org/ala/biocache/tool/Sampling.scala-381-    try {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-382-      val fw = new FileWriter(locFilePath)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-383-      coordinates.foreach { c =>
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-384-        fw.write(c)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-385-        fw.write("\n")
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-386-      }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-387-      fw.flush
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-388-      fw.close
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-389-      logger.info(s"Finished writing to file: $locFilePath")
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-390-    } catch {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala:391:      case e: Exception => logger.error("Failed to write - " + e.getMessage, e)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-392-    }

and another version of the same pattern at:

src/main/scala/au/org/ala/biocache/tool/Sampling.scala-405-    try {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-406-      val fw = new FileWriter(locFilePath)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-407-      lce.coordinates.foreach { c =>
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-408-        fw.write(c)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-409-        fw.write("\n")
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-410-      }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-411-      fw.flush
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-412-      fw.close
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-413-    } catch {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala:414:      case e: Exception => logger.error(e.getMessage, e)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-415-    }
ansell commented 6 years ago

The Sampling class does not fail when reading from a CSV file and writing to the loc table (and also does not consistently close the CSV reader object if errors occur in csvReader.readNext):

src/main/scala/au/org/ala/biocache/tool/Sampling.scala-618-          if (counter % 200 == 0 && callback != null) {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-619-            callback.setCurrentLayer(new IntersectionFile("", "", "", "finished. Processing loaded samples " + counter, "", "", "", "", null))
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-620-            callback.progressMessage("Loading sampling.")
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-621-          }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-622-          if (counter % 1000 == 0) {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-623-            logger.info(s"writing to loc: $counter : records per sec: " + 1000f / (((System.currentTimeMillis - nextTime).toFloat) / 1000f))
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-624-            nextTime = System.currentTimeMillis
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-625-          }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-626-          counter += 1
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-627-        } catch {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala:628:          case e: Exception => {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-629-            logger.error(e.getMessage, e)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-630-            logger.error("Problem writing line: " + counter + ", line length: " + line.length + ", header length: " + header.length)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-631-          }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-632-        }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-633-        line = csvReader.readNext
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-634-      }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-635-      csvReader.close
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-636-      if (batches.size > 0) {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-637-        LocationDAO.writeLocBatch(batches.toMap)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-638-      }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-639-    } catch {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala:640:      case e: Exception => {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-641-        logger.error("Problem loading sampling. " + e.getMessage, e)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-642-      }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-643-    }
ansell commented 6 years ago

The Sampling class does not propagate failure when LocColumnExporter exits without successfully completing its tasks:

src/main/scala/au/org/ala/biocache/tool/Sampling.scala-712-          "verbatimLatitude",
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-713-          "verbatimLongitude",
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-714-          "originalDecimalLatitude",
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-715-          "originalDecimalLongitude",
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-716-          "originalSensitiveValues")
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-717-
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-718-        val fin = System.currentTimeMillis
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-719-        logger.info("[Unique Coordinate Thread " + threadId + "] " + counter + " took " + ((fin - start).toFloat) / 1000f + " seconds")
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-720-      }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-721-    } catch {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala:722:      case e: Exception => logger.error("Error reading points for thread: " + threadId, e)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-723-    }
ansell commented 6 years ago

The Sampling class does not propagate failure when the LoadSamplingConsumer fails and does not preserve the Thread interrupted status when InterruptedException is caught and swallowed, and the CSV reader object is not consistently closed when issues occur:

src/main/scala/au/org/ala/biocache/tool/Sampling.scala-766-              batches += LocationDAO.addLayerIntersects(line(1), line(0), cl, el, true)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-767-
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-768-              if (counter % 1000 == 0 && counter > 0) {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-769-                logger.info("writing to loc:" + counter + ": records per sec: " + 1000f / (((System.currentTimeMillis - nextTime).toFloat) / 1000f))
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-770-                nextTime = System.currentTimeMillis
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-771-              }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-772-              counter += 1
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-773-            } catch {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala:774:              case e: Exception => {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-775-                logger.error(e.getMessage, e)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-776-                logger.error("Problem writing line: " + counter + ", line length: " + line.length + ", header length: " + header.length)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-777-              }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-778-            }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-779-            line = csvReader.readNext
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-780-          }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-781-          csvReader.close
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-782-          if (batches.size > 0) {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-783-            LocationDAO.writeLocBatch(batches.toMap)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-784-          }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-785-        } catch {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala:786:          case e: Exception => {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-787-            logger.error("Problem loading sampling. " + e.getMessage, e)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-788-          }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-789-        }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-790-        logger.info("Finished loading: " + inputFileName + " in " + (System.currentTimeMillis - startTime) + "ms")
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-791-
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-792-        doneList.add(inputFileName)
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-793-      }
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-794-    } catch {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-795-      case e: InterruptedException => {
src/main/scala/au/org/ala/biocache/tool/Sampling.scala-796-        //InterruptedException is expected when the thread is terminated with an interruption.
ansell commented 6 years ago

The ExpertDistributionOutlierTool silently succeeds when it fails to perform database operations, and can produce partial results after silently succeeding:

src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala-524-              logger.warn(rowKey + " is no longer an outlier")
src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala-525-              Config.persistenceManager.deleteColumns(rowKey, "occ", "distanceOutsideExpertRange" + Config.persistenceManager.fieldDelimiter + "p")
src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala-526-              Config.occurrenceDAO.removeSystemAssertion(rowKey, AssertionCodes.SPECIES_OUTSIDE_EXPERT_RANGE)
src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala-527-              rowKeysForIndexing += rowKey
src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala-528-            }
src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala-529-          } else {
src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala-530-            //add this row key to the "oldKeys"
src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala-531-            newOutlierRowKeys ++= oldRowKeys
src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala-532-          }
src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala-533-        } catch {
src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala:534:          case e: Exception => logger.error("Unable to get parse past distribution ingto array: " + oldRowKeysJson + " for " + lsid, e)
src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala-535-        }
src/main/scala/au/org/ala/biocache/tool/ExpertDistributionOutlierTool.scala-536-      }
ansell commented 6 years ago

The DuplicationDetection tool silently succeeds and does not persist duplicate details when database or other failures occur:

src/main/scala/au/org/ala/biocache/tool/DuplicationDetection.scala-618-          }
src/main/scala/au/org/ala/biocache/tool/DuplicationDetection.scala-619-
src/main/scala/au/org/ala/biocache/tool/DuplicationDetection.scala-620-          saveToOccurrenceRecord(primaryRecord, r, types)
src/main/scala/au/org/ala/biocache/tool/DuplicationDetection.scala-621-
src/main/scala/au/org/ala/biocache/tool/DuplicationDetection.scala-622-          buffer.synchronized {
src/main/scala/au/org/ala/biocache/tool/DuplicationDetection.scala-623-            buffer += r.rowKey
src/main/scala/au/org/ala/biocache/tool/DuplicationDetection.scala-624-          }
src/main/scala/au/org/ala/biocache/tool/DuplicationDetection.scala-625-        }
src/main/scala/au/org/ala/biocache/tool/DuplicationDetection.scala-626-      }
src/main/scala/au/org/ala/biocache/tool/DuplicationDetection.scala-627-    } catch {
src/main/scala/au/org/ala/biocache/tool/DuplicationDetection.scala:628:      case e: Exception => e.printStackTrace();
src/main/scala/au/org/ala/biocache/tool/DuplicationDetection.scala-629-    }
djtfmartin commented 6 years ago

thanks @ansell

Can you point out where java.lang.Error instances are wrapped inside of other Exception instances such as java.util.concurrent.ExecutionException ?

I cant see these.

ansell commented 6 years ago

@djtfmartin The patterns you need to look for specifically in terms of ExecutionException are any places where ExecutorService is used, particularly those places where the Callable or Consumer/Producer patterns are used.

I am compiling all of the catch case e: Exception instances (due to some syntax differences, the number quoted above isn't the full number, as my initial search didn't include catch case e:Exception). They will be included in the list when it is complete. If you can't see issues in the code shown in the comments above so far, then I may need to enhance the descriptions, as they all include bugs that need to be fixed. I am working on a biocache-service bug today, but will get back to this as soon as possible.

In general though, it is impossible to answer your question fully just from a static code review, as there is AOP (both Scala driven and other library driven) that adds and modifies code paths at runtime which may also introduce new wrappers. As part of fixing this widespread problem, biocache-store also needs significant software engineering work, including automated tests to reduce the ability of the catch case e: Exception pattern to be re-introduced in future.

This is a tracking issue for all of the catch case e: Exception instances that are almost all currently unuseful for a serious data processing library, so fixing the specific ExecutionException instances is not enough to fix the issue.

djtfmartin commented 6 years ago

thanks @ansell. I can see the issues with the exception handling above. This needs to be improved.

I can not see instances where we are catching or wrapping java.lang.Error. If we have these, then we should remove them as a priority.

ansell commented 6 years ago

Wrapping of java.lang.Error in the StackOverflowError case that triggered this audit (and appeared again today) is done by a Java Thread.UncaughtExceptionHandler, and/or the ExecutorService. The wrapped exception class that we need to be looking at the cause exception for in that case is a checked ExecutionException. The wrapping isn't done in biocache-store code so you won't find it just by searching for java.lang.Error in this code. Searching for invokeAll, Callable and Executor will show some of the code locations that are relevant in this case.

brucehyslop commented 2 years ago

biocache-store has been replaced by pipelines.