EmanuelOverflow / LSTM-TensorSpark

Implementation of a LSTM with TensorFlow and distributed on Apache Spark
MIT License
41 stars 16 forks source link

The spark mode cannot be started due to org.apache.spark.api.python.PythonException #4

Open wangao1236 opened 4 years ago

wangao1236 commented 4 years ago

Hello, I am a beginner of deep learning. I am learning about LSTM. Try to apply LSTM to spark, so refer to your code. I ran the spark pattern of your code in the environment of anaconda3 python2.7, but the following exception occurred. May I ask what causes the following results? The logs are as follows:

(tensor-spark) ➜  src git:(master) ✗ spark-submit rnn.py --training_path ../dataset/iris.data --labels_path ../dataset/labels.data --output_path train_dir_iris --partitions 4 > tmp.log
19/10/20 23:26:08 WARN Utils: Your hostname, wangaodeMacBook-Pro resolves to a loopback address: 127.0.0.1; using 10.135.139.166 instead (on interface en0)
19/10/20 23:26:08 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/10/20 23:26:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/10/20 23:26:09 INFO SparkContext: Running Spark version 2.4.4
19/10/20 23:26:09 INFO SparkContext: Submitted application: RNN-LSTM
19/10/20 23:26:09 INFO SecurityManager: Changing view acls to: wangao
19/10/20 23:26:09 INFO SecurityManager: Changing modify acls to: wangao
19/10/20 23:26:09 INFO SecurityManager: Changing view acls groups to: 
19/10/20 23:26:09 INFO SecurityManager: Changing modify acls groups to: 
19/10/20 23:26:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(wangao); groups with view permissions: Set(); users  with modify permissions: Set(wangao); groups with modify permissions: Set()
19/10/20 23:26:10 INFO Utils: Successfully started service 'sparkDriver' on port 55568.
19/10/20 23:26:10 INFO SparkEnv: Registering MapOutputTracker
19/10/20 23:26:10 INFO SparkEnv: Registering BlockManagerMaster
19/10/20 23:26:10 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/10/20 23:26:10 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/10/20 23:26:10 INFO DiskBlockManager: Created local directory at /private/var/folders/4p/f91tyykn4293vsb57ccfxz_m0000gn/T/blockmgr-41f34878-dc8b-413a-9999-f940d04ca46a
19/10/20 23:26:10 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
19/10/20 23:26:10 INFO SparkEnv: Registering OutputCommitCoordinator
19/10/20 23:26:10 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/10/20 23:26:10 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.135.139.166:4040
19/10/20 23:26:10 INFO Executor: Starting executor ID driver on host localhost
19/10/20 23:26:10 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55569.
19/10/20 23:26:10 INFO NettyBlockTransferService: Server created on 10.135.139.166:55569
19/10/20 23:26:10 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/10/20 23:26:10 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.135.139.166, 55569, None)
19/10/20 23:26:10 INFO BlockManagerMasterEndpoint: Registering block manager 10.135.139.166:55569 with 366.3 MB RAM, BlockManagerId(driver, 10.135.139.166, 55569, None)
19/10/20 23:26:10 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.135.139.166, 55569, None)
19/10/20 23:26:10 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.135.139.166, 55569, None)
LSTM - Partition: 1
LSTM - Partition: 2
LSTM - Partition: 3
LSTM - Partition: 0
2019-10-20 23:26:14.293966: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2019-10-20 23:26:14.294493: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 12. Tune using inter_op_parallelism_threads for best performance.
2019-10-20 23:26:14.301192: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2019-10-20 23:26:14.301805: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 12. Tune using inter_op_parallelism_threads for best performance.
2019-10-20 23:26:14.336795: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2019-10-20 23:26:14.337356: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 12. Tune using inter_op_parallelism_threads for best performance.
2019-10-20 23:26:14.339251: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2019-10-20 23:26:14.339749: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 12. Tune using inter_op_parallelism_threads for best performance.
Loss: 0.9999 - t_acc 0.500: 100%|██████████| 10/10 [00:01<00:00,  6.32it/s]                    
Loss: 1.4896 - t_acc 0.500: 100%|██████████| 10/10 [00:01<00:00,  6.28it/s]                    
Loss: 1.2986 - t_acc 0.800: 100%|██████████| 10/10 [00:01<00:00,  6.37it/s]
Loss: 2.1635 - t_acc 0.600: 100%|██████████| 10/10 [00:01<00:00,  6.27it/s]                    
RNN-LSTM - Partition: 1 - Time: 1.66123604774s
RNN-LSTM - Partition: 2 - Time: 1.67428016663s
RNN-LSTM - Partition: 3 - Time: 1.65914416313s
LSTM - Partition: 4
RNN-LSTM - Partition: 0 - Time: 1.71885895729s
19/10/20 23:26:16 ERROR Executor: Exception in task 4.0 in stage 1.0 (TID 5)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func
  File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func
  File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func
  File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func
  File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 352, in func
  File "/Users/wangao/My Code/Python/LSTM-TensorSpark/src/rnn.py", line 394, in <lambda>
    lambda x: train_rnn(x, net_settings, FLAGS), True)
  File "/Users/wangao/My Code/Python/LSTM-TensorSpark/src/rnn.py", line 209, in train_rnn
    rnn_model = rnn.RNN(net_settings)
  File "models/recurrent/rnn.py", line 11, in __init__
    dim_size=setting['dim_size'], batch_size=setting['batch_size'])
  File "models/recurrent/lstm.py", line 31, in __init__
    dtype=tf.float32),
  File "models/recurrent/lstm.py", line 7, in create_variable
    var = tf.compat.v1.get_variable(name=name, shape=shape, dtype=dtype, initializer=initializer())
  File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1496, in get_variable
    aggregation=aggregation)
  File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1239, in get_variable
    aggregation=aggregation)
  File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 562, in get_variable
    aggregation=aggregation)
  File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 514, in _true_getter
    aggregation=aggregation)
  File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 864, in _get_single_variable
    (err_msg, "".join(traceback.format_list(tb))))
ValueError: Variable LSTMLayer0/weights_forget_h already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

  File "models/recurrent/lstm.py", line 7, in create_variable
    var = tf.compat.v1.get_variable(name=name, shape=shape, dtype=dtype, initializer=initializer())
  File "models/recurrent/lstm.py", line 31, in __init__
    dtype=tf.float32),
  File "models/recurrent/rnn.py", line 11, in __init__
    dim_size=setting['dim_size'], batch_size=setting['batch_size'])
  File "/Users/wangao/My Code/Python/LSTM-TensorSpark/src/rnn.py", line 209, in train_rnn
    rnn_model = rnn.RNN(net_settings)
  File "/Users/wangao/My Code/Python/LSTM-TensorSpark/src/rnn.py", line 394, in <lambda>
    lambda x: train_rnn(x, net_settings, FLAGS), True)

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
        at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
        at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
        at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
        at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
        at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
19/10/20 23:26:16 ERROR TaskSetManager: Task 4 in stage 1.0 failed 1 times; aborting job
EmanuelOverflow commented 4 years ago

Maybe it is due to a different version of tensorflow, the problem seems to be in a variable initialization/reuse. This repository is not maintained due to its academic nature. Please use the official LSTM in tensorflow/keras

Emanuel Di Nardo

Il giorno 20 ott 2019, alle ore 17:38, 王骜 notifications@github.com ha scritto:

 Hello, I am a beginner of deep learning. I am learning about LSTM. Try to apply LSTM to spark, so refer to your code. I ran the spark pattern of your code in the environment of anaconda3 python2.7, but the following exception occurred. May I ask what causes the following results? The logs are as follows:

(tensor-spark) ➜ src git:(master) ✗ spark-submit rnn.py --training_path ../dataset/iris.data --labels_path ../dataset/labels.data --output_path train_dir_iris --partitions 4 > tmp.log 19/10/20 23:26:08 WARN Utils: Your hostname, wangaodeMacBook-Pro resolves to a loopback address: 127.0.0.1; using 10.135.139.166 instead (on interface en0) 19/10/20 23:26:08 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 19/10/20 23:26:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 19/10/20 23:26:09 INFO SparkContext: Running Spark version 2.4.4 19/10/20 23:26:09 INFO SparkContext: Submitted application: RNN-LSTM 19/10/20 23:26:09 INFO SecurityManager: Changing view acls to: wangao 19/10/20 23:26:09 INFO SecurityManager: Changing modify acls to: wangao 19/10/20 23:26:09 INFO SecurityManager: Changing view acls groups to: 19/10/20 23:26:09 INFO SecurityManager: Changing modify acls groups to: 19/10/20 23:26:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(wangao); groups with view permissions: Set(); users with modify permissions: Set(wangao); groups with modify permissions: Set() 19/10/20 23:26:10 INFO Utils: Successfully started service 'sparkDriver' on port 55568. 19/10/20 23:26:10 INFO SparkEnv: Registering MapOutputTracker 19/10/20 23:26:10 INFO SparkEnv: Registering BlockManagerMaster 19/10/20 23:26:10 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 19/10/20 23:26:10 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 19/10/20 23:26:10 INFO DiskBlockManager: Created local directory at /private/var/folders/4p/f91tyykn4293vsb57ccfxz_m0000gn/T/blockmgr-41f34878-dc8b-413a-9999-f940d04ca46a 19/10/20 23:26:10 INFO MemoryStore: MemoryStore started with capacity 366.3 MB 19/10/20 23:26:10 INFO SparkEnv: Registering OutputCommitCoordinator 19/10/20 23:26:10 INFO Utils: Successfully started service 'SparkUI' on port 4040. 19/10/20 23:26:10 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.135.139.166:4040 19/10/20 23:26:10 INFO Executor: Starting executor ID driver on host localhost 19/10/20 23:26:10 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55569. 19/10/20 23:26:10 INFO NettyBlockTransferService: Server created on 10.135.139.166:55569 19/10/20 23:26:10 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 19/10/20 23:26:10 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.135.139.166, 55569, None) 19/10/20 23:26:10 INFO BlockManagerMasterEndpoint: Registering block manager 10.135.139.166:55569 with 366.3 MB RAM, BlockManagerId(driver, 10.135.139.166, 55569, None) 19/10/20 23:26:10 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.135.139.166, 55569, None) 19/10/20 23:26:10 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.135.139.166, 55569, None) LSTM - Partition: 1 LSTM - Partition: 2 LSTM - Partition: 3 LSTM - Partition: 0 2019-10-20 23:26:14.293966: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags. 2019-10-20 23:26:14.294493: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 12. Tune using inter_op_parallelism_threads for best performance. 2019-10-20 23:26:14.301192: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags. 2019-10-20 23:26:14.301805: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 12. Tune using inter_op_parallelism_threads for best performance. 2019-10-20 23:26:14.336795: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags. 2019-10-20 23:26:14.337356: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 12. Tune using inter_op_parallelism_threads for best performance. 2019-10-20 23:26:14.339251: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags. 2019-10-20 23:26:14.339749: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 12. Tune using inter_op_parallelism_threads for best performance. Loss: 0.9999 - t_acc 0.500: 100%|██████████| 10/10 [00:01<00:00, 6.32it/s]
Loss: 1.4896 - t_acc 0.500: 100%|██████████| 10/10 [00:01<00:00, 6.28it/s]
Loss: 1.2986 - t_acc 0.800: 100%|██████████| 10/10 [00:01<00:00, 6.37it/s] Loss: 2.1635 - t_acc 0.600: 100%|██████████| 10/10 [00:01<00:00, 6.27it/s]
RNN-LSTM - Partition: 1 - Time: 1.66123604774s RNN-LSTM - Partition: 2 - Time: 1.67428016663s RNN-LSTM - Partition: 3 - Time: 1.65914416313s LSTM - Partition: 4 RNN-LSTM - Partition: 0 - Time: 1.71885895729s 19/10/20 23:26:16 ERROR Executor: Exception in task 4.0 in stage 1.0 (TID 5) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main process() File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/rdd.py", line 352, in func File "/Users/wangao/My Code/Python/LSTM-TensorSpark/src/rnn.py", line 394, in lambda x: train_rnn(x, net_settings, FLAGS), True) File "/Users/wangao/My Code/Python/LSTM-TensorSpark/src/rnn.py", line 209, in train_rnn rnn_model = rnn.RNN(net_settings) File "models/recurrent/rnn.py", line 11, in init dim_size=setting['dim_size'], batch_size=setting['batch_size']) File "models/recurrent/lstm.py", line 31, in init dtype=tf.float32), File "models/recurrent/lstm.py", line 7, in create_variable var = tf.compat.v1.get_variable(name=name, shape=shape, dtype=dtype, initializer=initializer()) File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1496, in get_variable aggregation=aggregation) File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1239, in get_variable aggregation=aggregation) File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 562, in get_variable aggregation=aggregation) File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 514, in _true_getter aggregation=aggregation) File "/Users/wangao/anaconda3/envs/tensor-spark/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 864, in _get_single_variable (err_msg, "".join(traceback.format_list(tb)))) ValueError: Variable LSTMLayer0/weights_forget_h already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

File "models/recurrent/lstm.py", line 7, in create_variable var = tf.compat.v1.get_variable(name=name, shape=shape, dtype=dtype, initializer=initializer()) File "models/recurrent/lstm.py", line 31, in init dtype=tf.float32), File "models/recurrent/rnn.py", line 11, in init dim_size=setting['dim_size'], batch_size=setting['batch_size']) File "/Users/wangao/My Code/Python/LSTM-TensorSpark/src/rnn.py", line 209, in train_rnn rnn_model = rnn.RNN(net_settings) File "/Users/wangao/My Code/Python/LSTM-TensorSpark/src/rnn.py", line 394, in lambda x: train_rnn(x, net_settings, FLAGS), True)

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

19/10/20 23:26:16 ERROR TaskSetManager: Task 4 in stage 1.0 failed 1 times; aborting job — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

wangao1236 commented 4 years ago

Maybe it is due to a different version of tensorflow, the problem seems to be in a variable initialization/reuse. This repository is not maintained due to its academic nature. Please use the official LSTM in tensorflow/keras

Thank you, I'll try it.