andabi / deep-voice-conversion

Deep neural networks for voice conversion (voice style transfer) in Tensorflow
MIT License
3.92k stars 843 forks source link

STUCK at Train1.py", Line 60 : launch_train_with_config(train_conf, trainer=trainer) #103

Open sallyjoy opened 5 years ago

sallyjoy commented 5 years ago

Any idea?

[0409 18:16:28 @parallel.py:193] [MultiProcessPrefetchData] Will fork a dataflow more than one times. This assumes the datapoints are i.i.d. [0409 18:16:28 @argtools.py:146] WRN "import prctl" failed! Install python-prctl so that processes can be cleaned with guarantee. [0409 18:16:28 @training.py:50] [DataParallel] Training a model of 2 towers. [0409 18:16:28 @interface.py:43] Automatically applying StagingInput on the DataFlow. Traceback (most recent call last): File "train1.py", line 80, in train(args, logdir=logdir_train1) File "train1.py", line 60, in train launch_train_with_config(train_conf, trainer=trainer) File "/usr/local/lib/python3.6/dist-packages/tensorpack/train/interface.py", line 90, in launch_train_with_config model.get_input_signature(), input, File "/usr/local/lib/python3.6/dist-packages/tensorpack/utils/argtools.py", line 200, in wrapper value = func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorpack/graph_builder/model_desc.py", line 86, in get_input_signature inputs = self.inputs() File "/usr/local/lib/python3.6/dist-packages/tensorpack/graph_builder/model_desc.py", line 116, in inputs raise NotImplementedError() NotImplementedError

ash13 commented 5 years ago

Were you able to fix this @sallyjoy ? I am also stuck at this!

YashBangera7 commented 5 years ago

@sallyjoy what did you give the command as to run this python train1.py code?

hallcacrx commented 5 years ago

I modified the model.py by changing the function name. and the error is gone. but I also encounter some other problem :build_graph() takes exactly 2 arguments (3 given). I suppose I don't have the proper version of tensorpack

kushmisra commented 5 years ago

I am also stuck at the same point. Anybody knows someway to correct it?

sebasdeldi commented 5 years ago

stuck at the same point

sallyjoy commented 5 years ago

@sallyjoy what did you give the command as to run this python train1.py code?

Thanks for replying. I have tried this command : python train1.py case -gpu 0

LucasMoskun commented 5 years ago

It looks like the function that is being called in the tensorboard model_desc folder has been depreciated, the body of the function has been totally removed and just throws the error: Link to model_desc.py (see line 136)

LucasMoskun commented 5 years ago

For a hack fix, this version/release of tensorpack seems to be compiling: (0.9.0.1) https://github.com/tensorpack/tensorpack/archive/0.9.0.1.zip

sallyjoy commented 5 years ago

For a hack fix, this version/release of tensorpack seems to be compiling: (0.9.0.1) https://github.com/tensorpack/tensorpack/archive/0.9.0.1.zip


Thanks for the suggestion.

I have installed tensorpack 0.9.0.1, the error is gone. Unfortunately, I got other strange errors. Actually, I am testing the code with the following datasets : timit and arctic. Later, if I it works, I am planing to replace arctic with my own dataset.


case: case, logdir: /data/private/vc/logdir/case/train1 /data/private/vc/datasets/timit/TIMIT/TRAIN///*.wav [0428 13:15:37 @logger.py:108] WRN Log directory /data/private/vc/logdir/case/train1 exists! Use 'd' to delete it. [0428 13:15:37 @logger.py:111] WRN If you're resuming from a previous run, you can choose to keep it. Press any other key to exit. Select Action: k (keep) / d (delete) / q (quit):d [0428 13:15:46 @logger.py:73] Argv: train1.py case -gpu 0 [0428 13:15:46 @parallel.py:186] [MultiProcessPrefetchData] Will fork a dataflow more than one times. This assumes the datapoints are i.i.d. [0428 13:15:46 @argtools.py:146] WRN Install python-prctl so that processes can be cleaned with guarantee. [0428 13:15:46 @config.py:165] WRN TrainConfig.nr_tower was deprecated! Set the number of GPUs on the trainer instead! [0428 13:15:47 @config.py:166] WRN See https://github.com/tensorpack/tensorpack/issues/458 for more information. -----OK(---------- [0428 13:15:47 @training.py:52] [DataParallel] Training a model of 2 towers. [0428 13:15:47 @training.py:54] ERR [DataParallel] TensorFlow was not built with CUDA support! [0428 13:15:47 @interface.py:46] Automatically applying StagingInput on the DataFlow. [0428 13:15:47 @develop.py:96] WRN [Deprecated] ModelDescBase._get_inputs() interface will be deprecated after 30 Mar. Use inputs() instead! [0428 13:15:47 @input_source.py:220] Setting up the queue 'QueueInput/input_queue' for CPU prefetching ... [0428 13:15:47 @training.py:112] Building graph for training tower 0 on device /gpu:0 ... [0428 13:15:47 @develop.py:96] WRN [Deprecated] ModelDescBase._build_graph() interface will be deprecated after 30 Mar. Use build_graph() instead! Process _Worker-4: Process _Worker-2: Process _Worker-1: Traceback (most recent call last): Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 163, in run for dp in self.ds: File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/common.py", line 116, in iter for data in self.ds: File "/content/deep-voice-conversion/data_load.py", line 35, in get_data yield get_mfccs_and_phones(wav_file=wav_file) File "/content/deep-voice-conversion/data_load.py", line 76, in get_mfccs_and_phones hp.default.hop_length) File "/content/deep-voice-conversion/data_load.py", line 148, in _get_mfcc_and_spec mel_basis = librosa.filters.mel(hp.default.sr, hp.default.n_fft, hp.default.n_mels) # (n_mels, 1+n_fft//2) File "/usr/local/lib/python3.6/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] ValueError: operands could not be broadcast together with shapes (1,257) (0,) Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 163, in run for dp in self.ds: File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/common.py", line 116, in iter for data in self.ds: File "/content/deep-voice-conversion/data_load.py", line 35, in get_data yield get_mfccs_and_phones(wav_file=wav_file) File "/content/deep-voice-conversion/data_load.py", line 76, in get_mfccs_and_phones hp.default.hop_length) File "/content/deep-voice-conversion/data_load.py", line 148, in _get_mfcc_and_spec mel_basis = librosa.filters.mel(hp.default.sr, hp.default.n_fft, hp.default.n_mels) # (n_mels, 1+n_fft//2) File "/usr/local/lib/python3.6/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() ValueError: operands could not be broadcast together with shapes (1,257) (0,) File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 163, in run for dp in self.ds: File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/common.py", line 116, in iter for data in self.ds: File "/content/deep-voice-conversion/data_load.py", line 35, in get_data yield get_mfccs_and_phones(wav_file=wav_file) File "/content/deep-voice-conversion/data_load.py", line 76, in get_mfccs_and_phones hp.default.hop_length) File "/content/deep-voice-conversion/data_load.py", line 148, in _get_mfcc_and_spec mel_basis = librosa.filters.mel(hp.default.sr, hp.default.n_fft, hp.default.n_mels) # (n_mels, 1+n_fft//2) File "/usr/local/lib/python3.6/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] ValueError: operands could not be broadcast together with shapes (1,257) (0,) Process _Worker-3: Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/parallel.py", line 163, in run for dp in self.ds: File "/root/.local/lib/python3.6/site-packages/tensorpack/dataflow/common.py", line 116, in iter for data in self.ds: File "/content/deep-voice-conversion/data_load.py", line 35, in get_data yield get_mfccs_and_phones(wav_file=wav_file) File "/content/deep-voice-conversion/data_load.py", line 76, in get_mfccs_and_phones hp.default.hop_length) File "/content/deep-voice-conversion/data_load.py", line 148, in _get_mfcc_and_spec mel_basis = librosa.filters.mel(hp.default.sr, hp.default.n_fft, hp.default.n_mels) # (n_mels, 1+n_fft//2) File "/usr/local/lib/python3.6/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] ValueError: operands could not be broadcast together with shapes (1,257) (0,) [0428 13:15:48 @develop.py:96] WRN [Deprecated] get_cost() and self.cost will be deprecated after 30 Mar. Return the cost tensor directly in build_graph() instead! [0428 13:15:48 @develop.py:96] WRN [Deprecated] ModelDescBase._get_optimizer() interface will be deprecated after 30 Mar. Use optimizer() instead! [0428 13:15:49 @training.py:112] Building graph for training tower 1 on device /gpu:1 ... [0428 13:15:49 @develop.py:96] WRN [Deprecated] ModelDescBase._build_graph() interface will be deprecated after 30 Mar. Use build_graph() instead! [0428 13:15:49 @develop.py:96] WRN [Deprecated] get_cost() and self.cost will be deprecated after 30 Mar. Return the cost tensor directly in build_graph() instead! [0428 13:15:51 @collection.py:164] These collections were modified but restored in tower1: (tf.GraphKeys.SUMMARIES: 3->5) [0428 13:15:52 @training.py:322] 'sync_variables_from_main_tower' includes 174 operations. [0428 13:15:52 @model_utils.py:64] Trainable Variables: name shape dim


net1/prenet/dense1/kernel:0 [40, 128] 5120 net1/prenet/dense1/bias:0 [128] 128 net1/prenet/dense2/kernel:0 [128, 64] 8192 net1/prenet/dense2/bias:0 [64] 64 net1/cbhg/conv1d_banks/num_1/conv1d/conv1d/kernel:0 [1, 64, 64] 4096 net1/cbhg/conv1d_banks/num_1/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_1/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_2/conv1d/conv1d/kernel:0 [2, 64, 64] 8192 net1/cbhg/conv1d_banks/num_2/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_2/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_3/conv1d/conv1d/kernel:0 [3, 64, 64] 12288 net1/cbhg/conv1d_banks/num_3/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_3/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_4/conv1d/conv1d/kernel:0 [4, 64, 64] 16384 net1/cbhg/conv1d_banks/num_4/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_4/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_5/conv1d/conv1d/kernel:0 [5, 64, 64] 20480 net1/cbhg/conv1d_banks/num_5/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_5/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_6/conv1d/conv1d/kernel:0 [6, 64, 64] 24576 net1/cbhg/conv1d_banks/num_6/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_6/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_7/conv1d/conv1d/kernel:0 [7, 64, 64] 28672 net1/cbhg/conv1d_banks/num_7/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_7/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_banks/num_8/conv1d/conv1d/kernel:0 [8, 64, 64] 32768 net1/cbhg/conv1d_banks/num_8/normalize/beta:0 [64] 64 net1/cbhg/conv1d_banks/num_8/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_1/conv1d/kernel:0 [3, 512, 64] 98304 net1/cbhg/normalize/beta:0 [64] 64 net1/cbhg/normalize/gamma:0 [64] 64 net1/cbhg/conv1d_2/conv1d/kernel:0 [3, 64, 64] 12288 net1/cbhg/highwaynet_0/dense1/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_0/dense1/bias:0 [64] 64 net1/cbhg/highwaynet_0/dense2/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_0/dense2/bias:0 [64] 64 net1/cbhg/highwaynet_1/dense1/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_1/dense1/bias:0 [64] 64 net1/cbhg/highwaynet_1/dense2/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_1/dense2/bias:0 [64] 64 net1/cbhg/highwaynet_2/dense1/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_2/dense1/bias:0 [64] 64 net1/cbhg/highwaynet_2/dense2/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_2/dense2/bias:0 [64] 64 net1/cbhg/highwaynet_3/dense1/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_3/dense1/bias:0 [64] 64 net1/cbhg/highwaynet_3/dense2/kernel:0 [64, 64] 4096 net1/cbhg/highwaynet_3/dense2/bias:0 [64] 64 net1/cbhg/gru/bidirectional_rnn/fw/gru_cell/gates/kernel:0 [128, 128] 16384 net1/cbhg/gru/bidirectional_rnn/fw/gru_cell/gates/bias:0 [128] 128 net1/cbhg/gru/bidirectional_rnn/fw/gru_cell/candidate/kernel:0 [128, 64] 8192 net1/cbhg/gru/bidirectional_rnn/fw/gru_cell/candidate/bias:0 [64] 64 net1/cbhg/gru/bidirectional_rnn/bw/gru_cell/gates/kernel:0 [128, 128] 16384 net1/cbhg/gru/bidirectional_rnn/bw/gru_cell/gates/bias:0 [128] 128 net1/cbhg/gru/bidirectional_rnn/bw/gru_cell/candidate/kernel:0 [128, 64] 8192 net1/cbhg/gru/bidirectional_rnn/bw/gru_cell/candidate/bias:0 [64] 64 net1/dense/kernel:0 [128, 61] 7808 net1/dense/bias:0 [61] 61 Total #vars=58, #params=363389, size=1.39MB [0428 13:15:52 @base.py:209] Setup callbacks graph ... [0428 13:15:52 @summary.py:38] Maintain moving average summary of 0 tensors in collection MOVING_SUMMARY_OPS. [0428 13:15:52 @summary.py:75] Summarizing collection 'summaries' of size 3. [0428 13:15:53 @base.py:227] Creating the session ... 2019-04-28 13:15:53.199861: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2019-04-28 13:15:53.199907: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2019-04-28 13:15:53.199917: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2019-04-28 13:15:53.199929: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2019-04-28 13:15:53.199939: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1297, in _run_fn self._extend_graph() File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1358, in _extend_graph self._session, graph_def.SerializeToString(), status) File "/usr/lib/python3.6/contextlib.py", line 88, in exit next(self.gen) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels:

[[Node: AllReduceGrads/NcclAllReduce_105 = NcclAllReduce[T=DT_FLOAT, num_devices=2, reduction="sum", shared_name="c52", _device="/device:GPU:1"](tower1/gradients/tower1/net1/cbhg/gru/bidirectional_rnn/bw/bw/while/bw/gru_cell/gates/gates/MatMul/Enter_grad/b_acc_3)]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "train1.py", line 82, in train(args, logdir=logdir_train1) File "train1.py", line 62, in train launch_train_with_config(train_conf, trainer=trainer) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/interface.py", line 97, in launch_train_with_config extra_callbacks=config.extra_callbacks) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/base.py", line 341, in train_with_defaults steps_per_epoch, starting_epoch, max_epoch) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/base.py", line 312, in train self.initialize(session_creator, session_init) File "/root/.local/lib/python3.6/site-packages/tensorpack/utils/argtools.py", line 176, in wrapper return func(*args, **kwargs) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/tower.py", line 144, in initialize super(TowerTrainer, self).initialize(session_creator, session_init) File "/root/.local/lib/python3.6/site-packages/tensorpack/utils/argtools.py", line 176, in wrapper return func(*args, **kwargs) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/base.py", line 229, in initialize self.sess = session_creator.create_session() File "/root/.local/lib/python3.6/site-packages/tensorpack/tfutils/sesscreate.py", line 43, in create_session sess.run(tf.global_variables_initializer()) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 895, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1124, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels: [[Node: AllReduceGrads/NcclAllReduce_105 = NcclAllReduce[T=DT_FLOAT, num_devices=2, reduction="sum", shared_name="c52", _device="/device:GPU:1"](tower1/gradients/tower1/net1/cbhg/gru/bidirectional_rnn/bw/bw/while/bw/gru_cell/gates/gates/MatMul/Enter_grad/b_acc_3)]] Caused by op 'AllReduceGrads/NcclAllReduce_105', defined at: File "train1.py", line 82, in train(args, logdir=logdir_train1) File "train1.py", line 62, in train launch_train_with_config(train_conf, trainer=trainer) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/interface.py", line 87, in launch_train_with_config model._build_graph_get_cost, model.get_optimizer) File "/root/.local/lib/python3.6/site-packages/tensorpack/utils/argtools.py", line 176, in wrapper return func(*args, **kwargs) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/tower.py", line 204, in setup_graph train_callbacks = self._setup_graph(input, get_cost_fn, get_opt_fn) File "/root/.local/lib/python3.6/site-packages/tensorpack/train/trainers.py", line 186, in _setup_graph self._make_get_grad_fn(input, get_cost_fn, get_opt_fn), get_opt_fn) File "/root/.local/lib/python3.6/site-packages/tensorpack/graph_builder/training.py", line 244, in build all_grads = allreduce_grads(all_grads, average=self._average) # #gpu x #param File "/root/.local/lib/python3.6/site-packages/tensorpack/tfutils/scope_utils.py", line 94, in wrapper return func(*args, **kwargs) File "/root/.local/lib/python3.6/site-packages/tensorpack/graph_builder/utils.py", line 157, in allreduce_grads summed = nccl.all_sum(grads) File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 48, in all_sum return _apply_all_reduce('sum', tensors) File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 154, in _apply_all_reduce shared_name=shared_name)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/nccl/ops/gen_nccl_ops.py", line 43, in nccl_all_reduce shared_name=shared_name, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1204, in __init__ self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'NcclAllReduce' with these attrs. Registered devices: [CPU], Registered kernels: [[Node: AllReduceGrads/NcclAllReduce_105 = NcclAllReduce[T=DT_FLOAT, num_devices=2, reduction="sum", shared_name="c52", _device="/device:GPU:1"](tower1/gradients/tower1/net1/cbhg/gru/bidirectional_rnn/bw/bw/while/bw/gru_cell/gates/gates/MatMul/Enter_grad/b_acc_3)]]
LucasMoskun commented 5 years ago

Try running without the GPU flag, I'm pretty sure if you are using tensorflow-gpu, but aren't using multiple gpu's, the tensorflow-gpu commands will automatically use the GPU you have already set up.

LucasMoskun commented 5 years ago

Also, I was recieving a lot of strange nccl errors, which I don't need nccl since I am only using one gpu. In train1.py and train2.py I've added from tensorpack.train.trainers import SimpleTrainer and then changed the line trainer = SyncMultiGPUTrainerReplicated(hp.train2.num_gpu) to trainer = SimpleTrainer()

Also, in tensorboard 0.9.0.1 in graph_builder utils.py I had to change the line from tensorflow.contrib import nccl to from tensorflow.python.ops.nccl_ops import all_sum then summed = all_sum(grads) a few lines below. This might not be necessary depending on your tensorflow version.

sallyjoy commented 5 years ago

Also, I was recieving a lot of strange nccl errors, which I don't need nccl since I am only using one gpu. In train1.py and train2.py I've added from tensorpack.train.trainers import SimpleTrainer and then changed the line trainer = SyncMultiGPUTrainerReplicated(hp.train2.num_gpu) to trainer = SimpleTrainer()

Also, in tensorboard 0.9.0.1 in graph_builder utils.py I had to change the line from tensorflow.contrib import nccl to from tensorflow.python.ops.nccl_ops import all_sum then summed = all_sum(grads) a few lines below. This might not be necessary depending on your tensorflow version.


I have changed Train1.py like you said, it seems working, no error is displayed. But it is a little strange because It starts Epoch 1 by showing this message that never change over time.


[0429 14:01:19 @base.py:233] Initializing the session ... [0429 14:01:19 @base.py:240] Graph Finalized. [0429 14:01:19 @concurrency.py:37] Starting EnqueueThread QueueInput/input_queue ... [0429 14:01:20 @base.py:272] Start Epoch 1 ... 0%| |0/100[00:00<?,?it/s]2019-04-29 14:01:21.923305: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

sallyjoy commented 5 years ago

It starts Epoch 1 but no change in the progress bar, never moves to Epoch 2 and no checkpoint stored after about 8 hours execution.

[0429 14:01:19 @base.py:233] Initializing the session ... [0429 14:01:19 @base.py:240] Graph Finalized. [0429 14:01:19 @concurrency.py:37] Starting EnqueueThread QueueInput/input_queue ... [0429 14:01:20 @base.py:272] Start Epoch 1 ... 0%| |0/100[00:00<?,?it/s]2019-04-29 14:01:21.923305: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

syedKhutub commented 5 years ago

It starts Epoch 1 but no change in the progress bar, never moves to Epoch 2 and no checkpoint stored after about 8 hours execution.

[0429 14:01:19 @base.py:233] Initializing the session ... [0429 14:01:19 @base.py:240] Graph Finalized. [0429 14:01:19 @concurrency.py:37] Starting EnqueueThread QueueInput/input_queue ... [0429 14:01:20 @base.py:272] Start Epoch 1 ... 0%| |0/100[00:00<?,?it/s]2019-04-29 14:01:21.923305: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally

I am also facing the same issue. Please reply if you have been able to solve this issue

Muhammad-MujtabaSaeed commented 5 years ago

Hi guys, After following your solution for this problem, I was also stuck on the same point but then I was able to figure out the solution and start the training. The issue is with .yaml files where data path starts as '/data/private/..' just change it to './data/private/....' by editing those files in the hparams folder. Then if an issue arises for librosa, update librosa to 0.6.2, then it will start working. Maybe this could help you guys.

syedKhutub commented 5 years ago

@Muhammad-MujtabaSaeed I have considered those changes but I am still stuck with this issue. It would be nice of you if you can share the changes you have done so that i can cross check the changes.

sallyjoy commented 5 years ago

@Muhammad-MujtabaSaeed I have considered those changes but I am still stuck with this issue. It would be nice of you if you can share the changes you have done so that i can cross check the changes.

Still stuck. It doesn't change anything for me too.

mattpeng3 commented 5 years ago

Anyone figure this out?^

sinKettu commented 5 years ago

Is there any progress with this issue? Stuck too.

yifanliuu commented 4 years ago

after installing tensorpack 0.9.0.1, another error comes out: Traceback (most recent call last): File "train1.py", line 78, in <module> train(args, logdir=logdir_train1) File "train1.py", line 60, in train launch_train_with_config(train_conf, trainer=trainer) File "/usr/local/lib/python2.7/dist-packages/tensorpack/train/interface.py", line 90, in launch_train_with_config model.get_input_signature(), input, File "/usr/local/lib/python2.7/dist-packages/tensorpack/utils/argtools.py", line 200, in wrapper value = func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorpack/graph_builder/model_desc.py", line 90, in get_input_signature inputs = self.inputs() File "/usr/local/lib/python2.7/dist-packages/tensorpack/graph_builder/model_desc.py", line 122, in inputs raise NotImplementedError() NotImplementedError

I' not sure if there is some problem with my tensorflow version? Does anyone meet with the same problem? Stucking...

neil3212080 commented 4 years ago

这个问题有什么进展吗?也卡住了

Hello i am also experiencing the same problem now, did you solve it later