gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 27 forks source link

LA-pipelines: expert distribution outlier detection #622

Open qifeng-bai opened 2 years ago

qifeng-bai commented 2 years ago

Check if a species occurrence record point is in/out of the expert distribution layers

Questions:

  1. Efficiency issue: Using polygons may slow down the process? Using Grids? how to determine the size of grids

Using LayerStore / Spatial Service, or build our local instance or Postgres

  1. Assertions we should use
  2. Possible to Calculate
  • a distance of the point inside/outside expected distribution field to the record
  • point outside the range and uncertainty overlaps the range
  • point outside the range and uncertainty outside the range

Two scenarios: 1, Calculate all exisiting occurrences with existing expert distribution layers - one-time run 2, Re-calculate the related species when a new export distribution layer is added.

Link to Data Quality (DQ) project : https://github.com/AtlasOfLivingAustralia/DataQuality/issues/255 Require to Spatial: https://github.com/AtlasOfLivingAustralia/spatial-service/issues/186

qifeng-bai commented 2 years ago

Modification to la-pipeline.yaml is needed Add 'outlier' section : outlier: appName: Expert distribution outliers for {datasetId} baseUrl: https://spatial.ala.org.au/ws/ inputPath: '{fsPath}/pipelines-outlier' targetPath: &outlierTargetPath '{fsPath}/pipelines-outlier' allDatasetsInputPath: '{fsPath}/pipelines-all-datasets' runner: DirectRunner

and insert the fellowing to 'solr' section

outlierPath: *outlierTargetPath includeOurtlier: false

qifeng-bai commented 2 years ago

SLF4J issues: 1, found the pipelines uses logback as the default logger, la-pipelines script currently is using log4j.properties , logging is not working.

2, Added logback.xml into ./resource folder. And import an external config via -Dlogback.configurationFile

The example of external config looks like:

<included>
<logger name="org.gbif" level="info" additivity="false"> <appender-ref ref="CONSOLE"/> </logger> <logger name="au.org.ala.pipelines" level="debug" additivity="false"> <appender-ref ref="CONSOLE"/> </logger> </included>

We can use it to update the log level. It works on my local dev environment.

However, we run it on NCI3-spark servers, the logger does not work. It does not accept : "included" element.

For making it work, we have to duplicate the content in ./resource/logback.xml