Closed wdon021 closed 3 years ago
Hi,
which version of Chainer are you using? It looks like your version is too new!
If you make sure that you are using the version specified in the file requirements.txt
you should not have any problems with some arrays still residing on the CPU while others are already on the GPU.
If you can not use the old Version you'll have to adapt the code to work with the newer way of specifying where arrays should be, as required in newer versions of Chainer.
If you do so, I would be happy to review a PR :wink:
Thank you @Bartzi for your reply, really appreciated.
I ran into another error once I change the chainer !pip install chainer==3.2.0
.
RuntimeError Traceback (most recent call last)
in () 200 # use the MultiProcessParallelUpdater in order to harness the full power of data parallel computation 201 # updater = MultiprocessParallelUpdater(train_iterators, optimizer, devices=gpus) --> 202 updater = StandardUpdater(train_iterators, optimizer, device=0) 203 log_dir = os.path.join(log_dir, "{}_{}".format(datetime.datetime.now().isoformat(), log_name)) 204 log_dir = log_dir 3 frames /usr/local/lib/python3.6/dist-packages/chainer/training/updater.py in __init__(self, iterator, optimizer, converter, device, loss_func) 144 if device is not None and device >= 0: 145 for optimizer in six.itervalues(self._optimizers): --> 146 optimizer.target.to_gpu(device) 147 148 self.converter = converter /usr/local/lib/python3.6/dist-packages/chainer/link.py in to_gpu(self, device) 727 728 def to_gpu(self, device=None): --> 729 with cuda._get_device(device): 730 super(Chain, self).to_gpu() 731 d = self.__dict__ /usr/local/lib/python3.6/dist-packages/chainer/cuda.py in _get_device(*args) 217 for arg in args: 218 if type(arg) in _integer_types: --> 219 check_cuda_available() 220 return Device(arg) 221 if isinstance(arg, ndarray): /usr/local/lib/python3.6/dist-packages/chainer/cuda.py in check_cuda_available() 78 '(see https://github.com/chainer/chainer#installation).') 79 msg += str(_resolution_error) ---> 80 raise RuntimeError(msg) 81 if (not cudnn_enabled and 82 not _cudnn_disabled_by_user and RuntimeError: CUDA environment is not correctly set up (see https://github.com/chainer/chainer#installation).cannot import name 'sqrt_fixed'
I tried to re-install the CUDA to 8.0
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2016 NVIDIA Corporation Built on Tue_Jan_10_13:22:03_CST_2017 Cuda compilation tools, release 8.0, V8.0.61
But the error remained. Thank you again for helping me out here.
Oh, I forgot to add:
You should also install cupy
for it to work. I suggest you install it with !pip install cupy==3.2.0
Chainer as such only delivers the computing logic, but everything that has to do with CUDA is handled by cupy, which is a drop-in replacement for numpy.
Please, also check the chainer documentation about using GPUs.
Hi @Bartzi Thank you for the reply.
I have tried !pip install cupy==2.2.0
as well.
I am currently going through every input and print out its type to see if I can locate the problem.
thanks for your time.
from what I observe, it happened at the beginning of the second epoch, at the start of localization network, in the first convolucation layer.
Looks like the <class 'numpy.ndarray'>
is carried over from somewhere or from epoch 1.
fixed by adding '.to_device(0)' to the model ( which is the Classifier
), my assumption is, it somehow didn't put the parameters (weight) into GPU at the end of the epoch, then trying to push the parameters to the next epoch's convolution process.
epoch_evaluator = ( chainer.training.extensions.Evaluator( epoch_validation_iterator, model.to_device(0), device=updater.device, converter=concat_and_pad_examples, ), (1, 'epoch') )
Now I have a second problem, which is the model is running after 1st epoch.
But I don't get to see the following information, do you know what went wrong?
total [##................................................] 4.50%
this epoch [#############################################.....] 90.07% 301 iter, 0 epoch / 20 epochs 0.42648 iters/sec. Estimated time to finish: 4:09:25.048276.
Thank you (I am slowly pick up chainer, I use Keras most of the time before) ( I am slowly adapting this to the newer version of Chainer, Cupy, and CUDA, as I didn't downgrade any of the packages)
the progress bar needs to be explicitly called by add progress_bar=True,
to the epoch_evaluator
But now I got this validation process keep re-iterating itself, do you know how to turn this off?
validation [########################################################################################################################################################################################################################################################################] 528.85% 506 / 1754 iterations -10.788 iters/sec. Estimated time to finish: -1 day, 23:54:27.815703. starts classifier ====================== Localization got to the start point=================== Localization got to first conv=================== Localization got to the end point=================== fsns resnet got to the end point=================== into the loss================================= into the accuracy================================= classifier got to the end point=================== validation [########################################################################################################################################################################################################################################################################] 529.53% 518 / 1754 iterations -10.779 iters/sec. Estimated time to finish: -1 day, 23:54:26.993988. starts classifier ====================== Localization got to the start point=================== Localization got to first conv=================== Localization got to the end point=================== fsns resnet got to the end point=================== into the loss================================= into the accuracy================================= classifier got to the end point=================== validation [#########################################################################################################################################################################################################################################################################] 530.22% 530 / 1754 iterations
thanks,
I see, change (1, 'epoch')
to (10, 'epoch')
, then it will stop doing validation every epoch, is my understanding correct?
Hmm, the validation iter does not stop.
This is most probably because the value of repeat
is not set to False
(see for example this line).
It should work if you set repeat=False
when building the validation iterator.
@Bartzi I see, thank you for your time, really appreciated!.
Hi Christian,
I have encountered this error at the end of epoch training (99.75%), I am using google colab to do the training.
Some online resources suggested adding
to_gpu()
at the end of every Convolution2DFunction, but it didn't work as well. Can you please help me with it? Thank you.