Open ryohang opened 7 years ago
This is a file that is present on an EMR cluster. This is to try to determine what instance type it is running against to determine some job settings such as memory. Obviously running locally you wouldn't have this file so this is expected. Does it fail to start?
We can maybe improve this to have a local mode where it won't look for this file.
Thanks for prompt response. it didn't fail to start though. Everything else was working except this console exception.
@ryohang Can you please tell how your local environment looks like? I'm trying to do the similar thing. My local Spark was not able to do a simple count operation on the table in the local dynamodb (but from the logs, I saw clearly the table's information is retrieved properly). Besides the exception that you mentioned above, I also get another exception as below when the spark program tries to do any actions. I'm using Spark 1.6 by the way.
17/11/06 17:12:15 ERROR ReadWorker: Unknown exception thrown! java.lang.NullPointerException at org.apache.hadoop.dynamodb.preader.ScanRecordReadRequest.fetchPage(ScanRecordReadRequest.java:47) at org.apache.hadoop.dynamodb.preader.AbstractRecordReadRequest.read(AbstractRecordReadRequest.java:46) at org.apache.hadoop.dynamodb.preader.ReadWorker.runInternal(ReadWorker.java:84) at org.apache.hadoop.dynamodb.preader.ReadWorker.run(ReadWorker.java:46)
Figured that out. I was using 4.2. 4.5 doesn't have this issue.
Still getting this error. Using version 4.5. Is there a resolution to this?
4.5 gives same error evern 4.8 , I am trying to run my app on my laptop (mac book)
My workaround consists of adding a file /mnt/var/lib/info/job-flow.json
in the filesystem of my Spark cluster.
The content of the file was taken from there:
{
"jobFlowId": "j-2AO77MNLG17NW",
"jobFlowCreationInstant": 1429046932628,
"instanceCount": 2,
"masterInstanceId": "i-08dea4f4",
"masterPrivateDnsName": "localhost",
"masterInstanceType": "m1.medium",
"slaveInstanceType": "m1.xlarge",
"hadoopVersion": "2.4.0",
"instanceGroups": [
{
"instanceGroupId": "ig-16NXM94TY33LB",
"instanceGroupName": "CORE",
"instanceRole": "Core",
"marketType": "OnDemand",
"instanceType": "m3.xlarge",
"requestedInstanceCount": 1
},
{
"instanceGroupId": "ig-2XQ29JGCTKLBL",
"instanceGroupName": "MASTER",
"instanceRole": "Master",
"marketType": "OnDemand",
"instanceType": "m1.medium",
"requestedInstanceCount": 1
}
]
}
If I run Spark locally, I add it to my filesystem. If I run Spark in a Docker container, I mount that file as a volume:
services:
spark-master:
image: bde2020/spark-master:2.4.4-hadoop2.7
...
volumes:
- ${PWD}/path/to/job-flow.json:/mnt/var/lib/info/job-flow.json
Dear team,
Thanks for making such awesome library. I was able to setup to connect my spark process at local. While running unit testing again local dynamodb, I am always getting following exception from library. It didn't interrupt my test nor my code though. I am wondering what caused this exception. Do I have to add additional config file ?
[Stage 11:=====> (9 + 4) / 100]2017-10-18 23:27:19,791 WARN : org.apache.hadoop.dynamodb.util.ClusterTopologyNodeCapacityProvider - Exception when trying to determine instance types java.nio.file.NoSuchFileException: /mnt/var/lib/info/job-flow.json at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) at java.nio.file.Files.newByteChannel(Files.java:361) at java.nio.file.Files.newByteChannel(Files.java:407) at java.nio.file.Files.readAllBytes(Files.java:3152) at org.apache.hadoop.dynamodb.util.ClusterTopologyNodeCapacityProvider.readJobFlowJsonString(ClusterTopologyNodeCapacityProvider.java:105) at org.apache.hadoop.dynamodb.util.ClusterTopologyNodeCapacityProvider.getCoreNodeMemoryMB(ClusterTopologyNodeCapacityProvider.java:44) at org.apache.hadoop.dynamodb.util.TaskCalculator.getMaxMapTasks(TaskCalculator.java:54) at org.apache.hadoop.dynamodb.DynamoDBUtil.calcMaxMapTasks(DynamoDBUtil.java:257) at org.apache.hadoop.dynamodb.write.WriteIopsCalculator.calculateMaxMapTasks(WriteIopsCalculator.java:79) at org.apache.hadoop.dynamodb.write.WriteIopsCalculator.(WriteIopsCalculator.java:64)
at org.apache.hadoop.dynamodb.write.AbstractDynamoDBRecordWriter.(AbstractDynamoDBRecordWriter.java:81)
at org.apache.hadoop.dynamodb.write.DefaultDynamoDBRecordWriter.(DefaultDynamoDBRecordWriter.java:27)
at org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat.getRecordWriter(DynamoDBOutputFormat.java:30)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1206)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)