Thought maybe that's a problem with the EMR version, tried it with emr-5.36.1 and got this:
Then changed the set -ex to set -x to not fail if there is a problem in bootstrap also reverted back the EMR version and it completed the ingest.
But still it shows that the command fails in bootstrap:
+ sudo hadoop fs -mkdir /dwca-exports
sudo: hadoop: command not found
This is an issue to collect some information regarding an issue that we had there other day for ingesting large datasets. fixed code here: https://github.com/AtlasOfLivingAustralia/pipelines-airflow/commit/563ea38c2f4d17c1a7efbdb460c6430e6cce4896
The first issue that we faced was when the bootstrap startup took more than 47min : https://ap-southeast-2.console.aws.amazon.com/emr/home?region=ap-southeast-2#/clusterDetails/j-DRYFVCGYKXUN![image](https://github.com/AtlasOfLivingAustralia/data-management/assets/1578598/6c218e33-cff5-47c3-8f0c-cc2289747ce0)
To solve the issue there, I started investigating, looking into the logs. Noticed that bootstrap logs aren't complete:
added 'set -ex' on top of the script to see where it gets stuck and saw this:![image](https://github.com/AtlasOfLivingAustralia/data-management/assets/1578598/018b7552-22f1-4817-bf1a-de380175c476)
with the log:
Thought maybe that's a problem with the EMR version, tried it with![image](https://github.com/AtlasOfLivingAustralia/data-management/assets/1578598/a8910269-6f25-4dce-b315-15b7fa15bd03)
emr-5.36.1
and got this:Then changed the
set -ex
toset -x
to not fail if there is a problem in bootstrap also reverted back the EMR version and it completed the ingest. But still it shows that the command fails in bootstrap: