gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Re-add the ability to load DwCAs from HDFS #729

Open djtfmartin opened 2 years ago

djtfmartin commented 2 years ago

During the EMR work, the ability to run the DWCA -> AVRO pipeline against archives stored in HDFS was dropped as this is not used in EMR deployments. However, it is still required in non-EMR deployments. This functionality needs to be added back in to the DWCA pipeline

djtfmartin commented 2 years ago

merged into dev

vjrj commented 2 years ago

Thanks @djtfmartin for this.

vjrj commented 10 months ago

Hi @djtfmartin

This sounds like a regression of this: image

Using pipelines 2.15.1~1.gbpe572d5.

vjrj commented 10 months ago

This started to fail in https://github.com/gbif/pipelines/pull/809 where:

image

in livingatlas/pipelines/pom.xml. I was a bit lost because hadoop-core is in parent pom.xml.

The last hdfs working build is 2.13.0-SNAPSHOT+0~20221123105804.867~1.gbp85d7b3.