awslabs / emr-dynamodb-connector

Implementations of open source Apache Hadoop/Hive interfaces which allow for ingesting data from Amazon DynamoDB
Apache License 2.0
216 stars 135 forks source link

java.nio.file.NoSuchFileException: /mnt/var/lib/info/job-flow.json exception from emr-dynamodb-hadoop #50

Open ryohang opened 6 years ago

ryohang commented 6 years ago

Dear team,

Thanks for making such awesome library. I was able to setup to connect my spark process at local. While running unit testing again local dynamodb, I am always getting following exception from library. It didn't interrupt my test nor my code though. I am wondering what caused this exception. Do I have to add additional config file ?

[Stage 11:=====> (9 + 4) / 100]2017-10-18 23:27:19,791 WARN : org.apache.hadoop.dynamodb.util.ClusterTopologyNodeCapacityProvider - Exception when trying to determine instance types java.nio.file.NoSuchFileException: /mnt/var/lib/info/job-flow.json at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) at java.nio.file.Files.newByteChannel(Files.java:361) at java.nio.file.Files.newByteChannel(Files.java:407) at java.nio.file.Files.readAllBytes(Files.java:3152) at org.apache.hadoop.dynamodb.util.ClusterTopologyNodeCapacityProvider.readJobFlowJsonString(ClusterTopologyNodeCapacityProvider.java:105) at org.apache.hadoop.dynamodb.util.ClusterTopologyNodeCapacityProvider.getCoreNodeMemoryMB(ClusterTopologyNodeCapacityProvider.java:44) at org.apache.hadoop.dynamodb.util.TaskCalculator.getMaxMapTasks(TaskCalculator.java:54) at org.apache.hadoop.dynamodb.DynamoDBUtil.calcMaxMapTasks(DynamoDBUtil.java:257) at org.apache.hadoop.dynamodb.write.WriteIopsCalculator.calculateMaxMapTasks(WriteIopsCalculator.java:79) at org.apache.hadoop.dynamodb.write.WriteIopsCalculator.(WriteIopsCalculator.java:64) at org.apache.hadoop.dynamodb.write.AbstractDynamoDBRecordWriter.(AbstractDynamoDBRecordWriter.java:81) at org.apache.hadoop.dynamodb.write.DefaultDynamoDBRecordWriter.(DefaultDynamoDBRecordWriter.java:27) at org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat.getRecordWriter(DynamoDBOutputFormat.java:30) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1206) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

z-york commented 6 years ago

This is a file that is present on an EMR cluster. This is to try to determine what instance type it is running against to determine some job settings such as memory. Obviously running locally you wouldn't have this file so this is expected. Does it fail to start?

We can maybe improve this to have a local mode where it won't look for this file.

ryohang commented 6 years ago

Thanks for prompt response. it didn't fail to start though. Everything else was working except this console exception.

sudododo commented 6 years ago

@ryohang Can you please tell how your local environment looks like? I'm trying to do the similar thing. My local Spark was not able to do a simple count operation on the table in the local dynamodb (but from the logs, I saw clearly the table's information is retrieved properly). Besides the exception that you mentioned above, I also get another exception as below when the spark program tries to do any actions. I'm using Spark 1.6 by the way. 17/11/06 17:12:15 ERROR ReadWorker: Unknown exception thrown! java.lang.NullPointerException at org.apache.hadoop.dynamodb.preader.ScanRecordReadRequest.fetchPage(ScanRecordReadRequest.java:47) at org.apache.hadoop.dynamodb.preader.AbstractRecordReadRequest.read(AbstractRecordReadRequest.java:46) at org.apache.hadoop.dynamodb.preader.ReadWorker.runInternal(ReadWorker.java:84) at org.apache.hadoop.dynamodb.preader.ReadWorker.run(ReadWorker.java:46)

sudododo commented 6 years ago

Figured that out. I was using 4.2. 4.5 doesn't have this issue.

ravishchawla commented 6 years ago

Still getting this error. Using version 4.5. Is there a resolution to this?

kali786516 commented 5 years ago

4.5 gives same error evern 4.8 , I am trying to run my app on my laptop (mac book)

julienrf commented 6 months ago

My workaround consists of adding a file /mnt/var/lib/info/job-flow.json in the filesystem of my Spark cluster.

The content of the file was taken from there:

{
  "jobFlowId": "j-2AO77MNLG17NW",
  "jobFlowCreationInstant": 1429046932628,
  "instanceCount": 2,
  "masterInstanceId": "i-08dea4f4",
  "masterPrivateDnsName": "localhost",
  "masterInstanceType": "m1.medium",
  "slaveInstanceType": "m1.xlarge",
  "hadoopVersion": "2.4.0",
  "instanceGroups": [
    {
      "instanceGroupId": "ig-16NXM94TY33LB",
      "instanceGroupName": "CORE",
      "instanceRole": "Core",
      "marketType": "OnDemand",
      "instanceType": "m3.xlarge",
      "requestedInstanceCount": 1
    },
    {
      "instanceGroupId": "ig-2XQ29JGCTKLBL",
      "instanceGroupName": "MASTER",
      "instanceRole": "Master",
      "marketType": "OnDemand",
      "instanceType": "m1.medium",
      "requestedInstanceCount": 1
    }
  ]
}

If I run Spark locally, I add it to my filesystem. If I run Spark in a Docker container, I mount that file as a volume:

services:
  spark-master:
    image: bde2020/spark-master:2.4.4-hadoop2.7
    ...
    volumes:
      - ${PWD}/path/to/job-flow.json:/mnt/var/lib/info/job-flow.json