Databox EMR fail - Githubissues

This is an issue to collect some information regarding an issue that we had there other day for ingesting large datasets. fixed code here: https://github.com/AtlasOfLivingAustralia/pipelines-airflow/commit/563ea38c2f4d17c1a7efbdb460c6430e6cce4896

The first issue that we faced was when the bootstrap startup took more than 47min : https://ap-southeast-2.console.aws.amazon.com/emr/home?region=ap-southeast-2#/clusterDetails/j-DRYFVCGYKXUN

To solve the issue there, I started investigating, looking into the logs. Noticed that bootstrap logs aren't complete:

31fe54d1616a: Pull complete
a2c746b22686: Pull complete
3c6999be7f0c: Pull complete
d837974742ac: Verifying Checksum
d837974742ac: Download complete
0884b3556289: Pull complete
8d9c7555f5be: Verifying Checksum
8d9c7555f5be: Download complete
528c65acdcba: Verifying Checksum
528c65acdcba: Download complete
e9984bdbb2b3: Verifying Checksum
e9984bdbb2b3: Download complete
2b654a3a1d3f: Verifying Checksum
2b654a3a1d3f: Download complete
20c550987e92: Verifying Checksum
20c550987e92: Download complete
ead2556a1885: Verifying Checksum
ead2556a1885: Download complete
f2a5c33ab129: Download complete
cc408e9f7472: Download complete

added 'set -ex' on top of the script to see where it gets stuck and saw this:

with the log:

ead2556a1885: Pull complete
f2a5c33ab129: Pull complete
cc408e9f7472: Pull complete
Digest: sha256:f955ef9dbc6940901acb4828cff7e3fb3a603486a719912890f2d455dcc44d28
Status: Downloaded newer image for djtfmartin/ala-sensitive-data-service:v20200214-4-multiarch
+ sudo ln -s /mnt /data
+ sudo mkdir -p /data/la-pipelines/config
+ sudo mkdir -p /data/biocache-load
+ sudo mkdir -p /data/pipelines-shp
+ sudo mkdir -p /data/pipelines-vocabularies
+ sudo mkdir -p /tmp/pipelines-export
+ sudo mkdir -p /data/dwca-tmp/
+ sudo chmod -R 777 /data/dwca-tmp/
+ sudo mkdir -p /data/spark-tmp
+ sudo chown hadoop:hadoop -R /mnt/dwca-tmp
+ sudo chown hadoop:hadoop -R /data/biocache-load /data/dwca-tmp /data/la-pipelines /data/pipelines-shp /data/pipelines-vocabularies /data/spark-tmp /data/tmp /data/var
chown: fts_read failed: No such file or directory

Thought maybe that's a problem with the EMR version, tried it with emr-5.36.1 and got this:

Then changed the set -ex to set -x to not fail if there is a problem in bootstrap also reverted back the EMR version and it completed the ingest. But still it shows that the command fails in bootstrap:

+ sudo hadoop fs -mkdir /dwca-exports
sudo: hadoop: command not found

AtlasOfLivingAustralia / data-management

Databox EMR fail #926