Closed pksorensen closed 6 years ago
Thanks for using our library!
Actually, I have developed something similar to what you have suggested. https://github.com/yuyu2172/image-labelling-tool/tree/master/examples/ssd#example
This example contains a script to train SSD (another object detector) on a dataset annotated by an annotation tool. Afterwards, it should be straight forward to train with your own annotated dataset (just swap the dataset path).
Cool, I will give it a try to see how easy it is. Also, i am guessing if it works easy, it would not be that hard to swap it for any of the other examples of chainercv. SegNet and Faster RCNN. Would love to see MarkRCNN here too ;)
Hello. I tried the sample apple-orange train.py. (ubuntu 14.04, chainer 2.02, chainercv 0.6.0, anaconda3, python 3.5.3) Actually, there was nothing happening forever, even after reducing the --val_iteration to 1. I gave up and hitting Ctrl C showed the following comments: Thanks!
Traceback (most recent call last):
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Process Process-3:
Traceback (most recent call last):
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
Process Process-8:
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/queues.py", line 94, in get
res = self._recv_bytes()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt
Traceback (most recent call last):
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
^CTraceback (most recent call last):
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/training/trainer.py", line 299, in run
entry.extension(self)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/training/extensions/_snapshot.py", line 39, in snapshot_object
_snapshot_object(trainer, target, filename.format(trainer), savefun)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/training/extensions/_snapshot.py", line 86, in _snapshot_object
savefun(tmppath, target)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/serializers/npz.py", line 70, in save_npz
numpy.savez_compressed(f, **s.target)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/numpy/lib/npyio.py", line 657, in savez_compressed
_savez(file, args, kwds, True)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/numpy/lib/npyio.py", line 706, in _savez
zipf.write(tmpfile, arcname=fname)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/zipfile.py", line 1499, in write
buf = cmpr.compress(buf)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 70, in <module>
args.resume)
File "/home/owner/chainer/labelling_tool/examples/ssd/train_utils.py", line 167, in train
trainer.run()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/training/trainer.py", line 305, in run
self.updater.finalize()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/training/updater.py", line 178, in finalize
iterator.finalize()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/site-packages/chainer/iterators/multiprocess_iterator.py", line 105, in finalize
worker.join()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/process.py", line 121, in join
res = self._popen.wait(timeout)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/popen_fork.py", line 51, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/multiprocessing/popen_fork.py", line 29, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Exception ignored in: <bound method ZipFile.__del__ of <zipfile.ZipFile [closed]>>
Traceback (most recent call last):
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/zipfile.py", line 1597, in __del__
self.close()
File "/home/owner/anaconda3/envs/chainercv/lib/python3.5/zipfile.py", line 1609, in close
self.fp.seek(self.start_dir)
ValueError: seek of closed file
I also ran into issues:
PS C:\dev\chainer\meter> docker run --rm -it -v C:/dev/chainer/meter/data/apple_orange_annotations/apple_orange_annotations:/data image-labelling-tool /bin/bash
chainer@74b47c9ba1a5:/src/image-labelling-tool/examples/ssd$ python train.py --train /data --label_names /data/apple_orange_label_names.yml --val_iteration 100
Downloading ...
From: https://github.com/yuyu2172/share-weights/releases/download/0.0.3/ssd300_voc0712_2017_06_06.npz
To: /home/chainer/.chainer/dataset/_dl_cache/39be0d7ab69efbab2387ccbf556b7e8d
% Total Recv Speed Time left
100 93MiB 93MiB 8281KiB/s 0:00:00Killed
chainer@74b47c9ba1a5:/src/image-labelling-tool/examples/ssd$ python train.py --train /data --label_names /data/apple_orange_label_names.yml --val_iteration 100
^C^CTraceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/chainer/training/trainer.py", line 296, in run
update()
File "/opt/conda/lib/python3.6/site-packages/chainer/training/updater.py", line 223, in update
self.update_core()
File "/opt/conda/lib/python3.6/site-packages/chainer/training/updater.py", line 234, in update_core
optimizer.update(loss_func, *in_arrays)
File "/opt/conda/lib/python3.6/site-packages/chainer/optimizer.py", line 527, in update
loss = lossfun(*args, **kwds)
File "/src/image-labelling-tool/examples/ssd/train_utils.py", line 33, in __call__
mb_locs, mb_confs = self.model(imgs)
File "/opt/conda/lib/python3.6/site-packages/chainercv/links/model/ssd/ssd.py", line 130, in __call__
return self.multibox(self.extractor(x))
File "/opt/conda/lib/python3.6/site-packages/chainercv/links/model/ssd/ssd_vgg16.py", line 153, in __call__
ys = super(VGG16Extractor300, self).__call__(x)
File "/opt/conda/lib/python3.6/site-packages/chainercv/links/model/ssd/ssd_vgg16.py", line 82, in __call__
h = F.relu(self.conv2_2(h))
File "/opt/conda/lib/python3.6/site-packages/chainer/links/connection/convolution_2d.py", line 154, in __call__
x, self.W, self.b, self.stride, self.pad)
File "/opt/conda/lib/python3.6/site-packages/chainer/functions/connection/convolution_2d.py", line 439, in convolution_2d
return func(x, W, b)
File "/opt/conda/lib/python3.6/site-packages/chainer/function.py", line 200, in __call__
outputs = self.forward(in_data)
File "/opt/conda/lib/python3.6/site-packages/chainer/function.py", line 329, in forward
return self.forward_cpu(inputs)
File "/opt/conda/lib/python3.6/site-packages/chainer/functions/connection/convolution_2d.py", line 89, in forward_cpu
self.col, W, ((1, 2, 3), (1, 2, 3))).astype(x.dtype, copy=False)
File "/opt/conda/lib/python3.6/site-packages/numpy/core/numeric.py", line 1337, in tensordot
at = a.transpose(newaxes_a).reshape(newshape_a)
^C
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 70, in <module>
args.resume)
File "/src/image-labelling-tool/examples/ssd/train_utils.py", line 167, in train
trainer.run()
File "/opt/conda/lib/python3.6/site-packages/chainer/training/trainer.py", line 305, in run
self.updater.finalize()
File "/opt/conda/lib/python3.6/site-packages/chainer/training/updater.py", line 178, in finalize
iterator.finalize()
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 105, in finalize
worker.join()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 121, in join
res = self._popen.wait(timeout)
File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 51, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 29, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
chainer@74b47c9ba1a5:/src/image-labelling-tool/examples/ssd$
This is the setup i used
####################Python 3#########################
ARG python_version=3.5.2
RUN conda install -y python=${python_version} && \
pip install -U numpy && \
pip install chainercv && \
pip install opencv-python && \
pip install pyyaml && \
conda clean -yt
My best guest is that its due to out of memory in my case
Thank you for commenting. (1) For the installing, no mention of chainer itself for its installing order. Also, cupy was requested to install. (2) After installing the same setting plus chainer and cupy, I again failed to have any response. My additional setting includes NVIDIA GTX 1080Ti and 32G RAM, such that memory should not be the issue.
If nothing happens in training, please check #386. https://github.com/chainer/chainercv/issues/386#issuecomment-321485827
I also got it running now that I got a machine with more RAM. But same experience as above that it dont look like nothing happen.
Also, as being new to chainer. What do one do after (assuming i get it working) trainning? is the new model file stored somewhere so you can use it for prediction?
So, the possible reasons for nothing happening to me are the following? 1: RAM is too small. Reducing the batchsize from 8 to 2 did not improve. 2: opencv issue. uninstalling opencv-python and installing cv3 by "conda install -c https://conda.anaconda.org/menpo opencv3" did not change it 3: MultiprocessIterator has a bug
I solved the RAM issue, it needs around 8GB. I tried compiling opencv 3.3 myself, and did not solve it.
Heres the output when interupting it, if it helps anyone.
chainer@5a84acfb0706:/src/image-labelling-tool/examples/ssd$ python train.py --train /data --label_names /data/apple_orange_label_names.yml --out /data/logs --val_iteration 1
^CProcess Process-2:
Process Process-9:
Process Process-1:
Process Process-5:
Process Process-3:
Process Process-4:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 94, in get
res = self._recv_bytes()
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
Traceback (most recent call last):
KeyboardInterrupt
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Process Process-10:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Process Process-11:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Process Process-12:
Process Process-13:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Process Process-14:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Process Process-17:
Process Process-20:
Process Process-19:
Process Process-18:
Process Process-15:
Process Process-16:
Process Process-8:
Process Process-6:
Traceback (most recent call last):
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
Process Process-7:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
KeyboardInterrupt
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.6/site-packages/chainer/iterators/multiprocess_iterator.py", line 386, in _worker
cnt, mem_index, index = in_queue.get()
File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 93, in get
with self._rlock:
File "/opt/conda/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
File "train.py", line 70, in <module>
args.resume)
File "/src/image-labelling-tool/examples/ssd/train_utils.py", line 167, in train
trainer.run()
File "/opt/conda/lib/python3.6/site-packages/chainer/training/trainer.py", line 299, in run
entry.extension(self)
File "/opt/conda/lib/python3.6/site-packages/chainer/training/extensions/evaluator.py", line 137, in __call__
result = self.evaluate()
File "/opt/conda/lib/python3.6/site-packages/chainercv/extensions/evaluator/detection_voc_evaluator.py", line 76, in evaluate
target.predict, it)
File "/opt/conda/lib/python3.6/site-packages/chainercv/utils/iterator/apply_prediction_to_iterator.py", line 93, in apply_prediction_to_iterator
_apply(predict, iterator, hook))
File "/opt/conda/lib/python3.6/site-packages/chainercv/utils/iterator/unzip.py", line 87, in unzip
values = next(iterator)
File "/opt/conda/lib/python3.6/site-packages/chainercv/utils/iterator/apply_prediction_to_iterator.py", line 129, in _apply
pred_values = predict(imgs)
File "/opt/conda/lib/python3.6/site-packages/chainercv/links/model/ssd/ssd.py", line 204, in predict
mb_locs, mb_confs = self(x)
File "/opt/conda/lib/python3.6/site-packages/chainercv/links/model/ssd/ssd.py", line 130, in __call__
return self.multibox(self.extractor(x))
File "/opt/conda/lib/python3.6/site-packages/chainercv/links/model/ssd/ssd_vgg16.py", line 153, in __call__
ys = super(VGG16Extractor300, self).__call__(x)
File "/opt/conda/lib/python3.6/site-packages/chainercv/links/model/ssd/ssd_vgg16.py", line 91, in __call__
h = F.relu(self.conv4_2(h))
File "/opt/conda/lib/python3.6/site-packages/chainer/links/connection/convolution_2d.py", line 154, in __call__
x, self.W, self.b, self.stride, self.pad)
File "/opt/conda/lib/python3.6/site-packages/chainer/functions/connection/convolution_2d.py", line 439, in convolution_2d
return func(x, W, b)
File "/opt/conda/lib/python3.6/site-packages/chainer/function.py", line 200, in __call__
outputs = self.forward(in_data)
File "/opt/conda/lib/python3.6/site-packages/chainer/function.py", line 329, in forward
return self.forward_cpu(inputs)
File "/opt/conda/lib/python3.6/site-packages/chainer/functions/connection/convolution_2d.py", line 89, in forward_cpu
self.col, W, ((1, 2, 3), (1, 2, 3))).astype(x.dtype, copy=False)
File "/opt/conda/lib/python3.6/site-packages/numpy/core/numeric.py", line 1337, in tensordot
at = a.transpose(newaxes_a).reshape(newshape_a)
KeyboardInterrupt
@HIN0209 @pksorensen
Thanks for trying my script.
I looked at your command line outputs, and it seems that the script is working as expected.
It does not produce any outputs because it has not reached a checkpoint yet.
The default option is optimized for a setting where there is GPU.
It would be difficult to reach the checkpoint with CPU in a reasonable time.
I added an option --log_iteration
to the script. This will set the number of iterations until the log is reported.
Please try it again with --log_iteration 1
and --batchsize 1
options.
@HIN0209
By looking at your output, the script is saving a model to a file. By setting --val_iteration
to 1, the model is saved every iteration.
Cool, thanks for updating us @yuyu2172
What should we expect in training time? on a GPU?
I dont have any nvidia graphic cards. (asumming it dont work with radeon?) But I do have 24 CPU cores.
I used the updated codes and it is working! Thank you for the quick work. I will finish training to do the rest (demo, etc). FYI: 11.2 iters/sec training time with NVIDIA GTX 1080Ti on the provided apple-orange set (batchsize=1)
"model_iter_400" would be nice to be provided in the package for other people who just want the demo.
Also, a nice document like this tensorflow object-detection API would be really appreciated. https://medium.com/towards-data-science/how-to-train-your-own-object-detector-with-tensorflows-object-detector-api-bec72ecfe1d9
If @yuyu2172 is not much for writing, i can write up a post about this whenever I have gathered enough demo matrial :)
@yuyu2172 It works for --batchsize 1, and this means there is a bug in the multiprocess thing.
I tried building opencv with TPP instead as sugested in the other issues, but its all the same. if anyone comes to find out why it fails with multi processes, it would be nice
chainer@1f96ee0f2adc:/src/image-labelling-tool/examples/ssd$ python train.py --train /data --label_names /data/apple_orange_label_names.yml --out /data/logs --val_iteration 100 --log_iteration 1 --batchsize 2 epoch iteration lr main/loss main/loss/loc main/loss/conf validation/main/map 0 1 0.0001 14.9106 2.15745 12.7531 0 2 0.0001 11.9335 1.71874 10.2148 0 3 0.0001 10.9854 1.35317 9.63226 0 4 0.0001 10.7581 1.18375 9.57433 0 5 0.0001 10.1219 1.07298 9.04888 0 6 0.0001 10.6807 1.71468 8.966 0 7 0.0001 10.4193 1.95761 8.46173
It works
Thanks guys for nice feedbacks :)
"model_iter_400" would be nice to be provided in the package for other people who just want the demo.
OK. I will work on it.
What should we expect in training time? on a GPU?
The speed HIN0209 posted is reasonable. For training a detector on more complex dataset like VOC dataset, it takes about a day and a half to train with Titan X. I am talking about the script here https://github.com/chainer/chainercv/tree/master/examples/ssd. It is important to note that the data augmentation used in these scripts are quite intensive.
If @yuyu2172 is not much for writing, i can write up a post about this whenever I have gathered enough demo matrial :)
Big +1 for this! The current README is just a raw list of features and commands. It would be great to have a write up post.
Cool, I estimate that I will use 1 to 2 weeks on playing around and by the end of this I will do a small tutorial / write up about it for others to reproduce.
When running with 20 CPUs, the output is the following:
epoch iteration lr main/loss main/loss/loc main/loss/conf validation/main/map
0 1 0.0001 12.6748 1.21717 11.4576
0 2 0.0001 12.1243 1.07264 11.0517
0 3 0.0001 11.0204 1.13749 9.88291
0 4 0.0001 11.2276 1.83673 9.39091
0 5 0.0001 10.5403 1.55973 8.98054
0 6 0.0001 10.2046 1.22746 8.97713
0 7 0.0001 9.44188 1.12225 8.31962
0 8 0.0001 9.34091 1.43901 7.9019
0 9 0.0001 8.64663 0.897129 7.7495
0 10 0.0001 8.61286 1.02696 7.5859
0 11 0.0001 8.40919 1.08716 7.32203
0 12 0.0001 8.05794 1.05238 7.00556
0 13 0.0001 7.55833 0.9543 6.60403
0 14 0.0001 7.73507 1.23755 6.49752
0 15 0.0001 7.43038 1.11817 6.31221
0 16 0.0001 7.78986 1.65317 6.13669
0 17 0.0001 6.92077 0.992295 5.92847
0 18 0.0001 6.85563 0.7304 6.12523
0 19 0.0001 6.82733 1.04058 5.78675
0 20 0.0001 6.07858 0.752546 5.32603
1 21 0.0001 6.3543 1.1031 5.2512
1 22 0.0001 6.02154 0.749306 5.27223
1 23 0.0001 6.05218 0.767107 5.28507
1 24 0.0001 6.41004 1.26777 5.14227
1 25 0.0001 5.5617 0.855585 4.70612
1 26 0.0001 5.52922 0.662999 4.86622
1 27 0.0001 6.09916 1.37244 4.72672
1 28 0.0001 5.50713 1.14605 4.36108
1 29 0.0001 5.18924 0.803506 4.38573
1 30 0.0001 4.88892 0.784509 4.10441
1 31 0.0001 4.79013 0.793774 3.99636
1 32 0.0001 5.08947 0.997038 4.09243
1 33 0.0001 4.88988 0.652114 4.23777
1 34 0.0001 5.36768 1.01028 4.3574
1 35 0.0001 4.78902 0.847532 3.94149
1 36 0.0001 4.42266 0.711169 3.71149
1 37 0.0001 4.63285 0.76253 3.87032
1 38 0.0001 4.87928 0.860469 4.01881
1 39 0.0001 4.65221 0.893757 3.75845
1 40 0.0001 5.23696 1.0522 4.18475
1 41 0.0001 4.69475 0.789808 3.90494
2 42 0.0001 4.77091 1.02642 3.74449
2 43 0.0001 4.79025 1.18098 3.60927
2 44 0.0001 3.99751 0.624271 3.37324
2 45 0.0001 4.3825 0.8421 3.5404
2 46 0.0001 4.15634 0.418492 3.73785
2 47 0.0001 4.30043 0.770104 3.53032
2 48 0.0001 4.29414 0.724917 3.56922
2 49 0.0001 4.43415 0.96855 3.4656
2 50 0.0001 4.40443 1.01633 3.3881
2 51 0.0001 3.97535 0.804308 3.17104
2 52 0.0001 4.09503 0.681377 3.41366
2 53 0.0001 4.09407 0.782554 3.31151
2 54 0.0001 4.2508 0.89875 3.35205
2 55 0.0001 3.9467 0.668529 3.27817
2 56 0.0001 4.00805 0.87974 3.12831
2 57 0.0001 4.66475 1.39154 3.27321
Do you have a comment on if you believe its better to run with less cores and get faster epocs or if the above can be accumulated to the same work as 60 epocs if only using one CPU?
I ask because its only two physical CPUS in the machine and using docker this gives 24 virtual cpus. So using 20 as in above setting, some resources are shared and i am wondering if its better to run with less cores. If so, is it possible to create a --argument for setting the number of processes in the multiprocessor setup?
Also, jsut saw this in my powershell script.
3 79 0.0001 3.53066 0.596175 2.93449
3 80 0.0001 3.5481 0.603821 2.94428
^[[B total [..................................................] 0.07%
this epoch [#########################################.........] 83.23%
80 iter, 3 epoch / 120000 iterations
0.016281 iters/sec. Estimated time to finish: 85 days, 5:57:27.408362.
Not sure how /when it was showing, but hehe 85days for my hardware setup :D I need to spin up a vm in the cloud instead :)
If so, is it possible to create a --argument for setting the number of processes in the multiprocessor setup?
You mean the number of processes to launch for MultiprocessIterator
?
Yeah, it can be configured. I just added an option --loaderjob INTEGER
to do that.
I can not say much about the number of virtual cores. FYI, bulk of the time is spent calling NumPy matrix multiplication functions.
85 days are crazy number, but for small datasets like this, you do not need to train for 12000 iterations. Usually, a model converges much faster than that.
It is nice to hear that the OpenCV worked fine with multiprocessing
.
It did not work in one of my environments.
It is frustrating, and I hope to find a solution to this problem.
I just changed the code to stop using multiprocessing
when --loaderjob
is 0.
I just tried with --batchsize 1 and it takes 1/10 the time of when used 20 processes.
I will rebuild and try with loaderjob = 2 (the number of physical cores and see if I can get it to 1/20 of the time - being 4days on CPU.
And if you say 12000 instead of the default 120000, then its less than one day to train a simple model like this on standard hardware which is okay for starters.
Will be fun trying with GPU on azure soon also.
@yuyu2172 Thanks for all the help btw.
Also, do you know if its possible to specify where chainer downloads the precomputed model that it donwload in the beginning? (would like to tell it to store it in one of the mounted volumes in docker instead of inside the docker container)
I just finished training and demo, which both worked fine on the apple-orange dataset. I will then move on to my own dataset to see how it goes.
The README.md for "Demo code to visualize output" has a typo: "python demo.py --pratrained_model result/model_iter_400" pretrained, instead of pratrained, just FYI.
Also, do you know if its possible to specify where chainer downloads the precomputed model that it donwload in the beginning? (would like to tell it to store it in one of the mounted volumes in docker instead of inside the docker container)
You can do this by the following command.
export CHAINER_DATASET_ROOT=/mnt/
More description can be found here. https://docs.chainer.org/en/stable/reference/core/dataset.html#dataset-abstraction
The README.md for "Demo code to visualize output" has a typo: "python demo.py --pratrained_model result/model_iter_400" pretrained, instead of pratrained, just FYI.
Thank you!
And if you say 12000 instead of the default 120000, then its less than one day to train a simple model like this on standard hardware which is okay for starters.
Sorry. It was typo. I wanted to say that 120000 is overkill.
Ye i noticed it the second time i read it :)
I just tried the loaderjob and turns out it was the reduced batchsize that reduced the compute time: 100 93MiB 93MiB 9243KiB/s 0:00:00epoch iteration lr main/loss main/loss/loc main/loss/conf validation/main/map 0 1 0.0001 12.8458 1.24489 11.601 0 2 0.0001 11.4097 1.25192 10.1578 0 3 0.0001 11.5204 1.40347 10.117 0 4 0.0001 11.1528 0.975648 10.1772 0 5 0.0001 11.0459 1.58965 9.45627 0 6 0.0001 9.79693 1.28163 8.5153 0 7 0.0001 9.89153 1.13055 8.76098 0 8 0.0001 9.61068 1.18566 8.42502 0 9 0.0001 9.36866 1.50945 7.85922 0 10 0.0001 8.27591 0.609303 7.66661 0 11 0.0001 8.18028 0.790003 7.39028 0 12 0.0001 8.294 1.11646 7.17754 0 13 0.0001 7.69968 0.950753 6.74893 0 14 0.0001 7.9388 1.21336 6.72544 0 15 0.0001 7.03268 0.750632 6.28205 0 16 0.0001 7.68714 1.28199 6.40515 0 17 0.0001 6.90286 0.75934 6.14352 0 18 0.0001 6.93581 0.968773 5.96704 0 19 0.0001 6.5233 0.860986 5.66232 0 20 0.0001 6.81927 1.13721 5.68206 total [..................................................] 0.02% this epoch [###############################################...] 95.81% 20 iter, 0 epoch / 120000 iterations 0.015884 iters/sec. Estimated time to finish: 87 days, 10:13:51.144739.
I will let it run to 100 again to get validation/main/map for comparison.
Btw, the 3 columns main/loss main/loss/loc main/loss/conf validation/main/map
- i was not able to google what these was exactly. I assume validation/main/map is the validation error on the test set?
They are values reported during training except for validation/main/map
.
https://github.com/yuyu2172/image-labelling-tool/blob/master/examples/ssd/train_utils.py#L38
They are loss values.
You can get more details by reading the paper on SSD and ChainerCV code.
validation/main/map
is mAP (mean average precision) value computed with validation dataset.
It is computed inside DetectionVOCEvaluator
.
https://github.com/yuyu2172/image-labelling-tool/blob/master/examples/ssd/train_utils.py#L148
I have set up a machine with 4 GPUs, but if i read it correct here http://docs.chainer.org/en/stable/tutorial/gpu.html#data-parallel-computation-on-multiple-gpus-with-trainer
it is also needed to update the train code of your repro right? Want me to try myself or something you see an easy change/fix for?
I heard that simply using the default multi GPU Updater
does not work.
(I am sorry that I use terms in Chainer. They are well explained in the documentation. I also encourage you to read the source code of these abstractions. They are pretty straight forward.)
This is because SSD uses hard negative mining.
I would recommend you to stick with single GPU for the time being. I think they are fast enough in most cases. We "might" add a reference implementation that trains with multi-GPUs.
Cool, thanks for the insight. And no worries aobut chainer terms. i am a quick study.
Hello, I now moved to Image Labelling Tool https://github.com/yuyu2172/image-labelling-tool#example
I noted a few things to make clear. (1) flask_app.py did not work with python3.5. I had corrected syntax errors like print () and TypeError(), but it then showed the following error: Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so
Googling the error suggested numpy problems and they suggested either installing from conda or pip. After few installing/uninstalling, my conda environment finally crashed. I gave it up and instead worked with python2.7 and worked.
(2) Your README.md is vague regarding your suggested format. For example, do you recommend [--slic] option to be used or not? And instead of "For those who want to have label images", please specify what I should do for the code to work.
Help is appreciated.
A small update here also: I got it running on GPU
chainer@f583d4b31f8f:/src/image-labelling-tool/examples/ssd$ python train.py --train /data/apple_orange_annotations --label_names /data/apple_orange_annotations/apple_orange_label_names.yml --out /data/logs --gpu 0 --val_iteration 100 --log_iteration 10
epoch iteration lr main/loss main/loss/loc main/loss/conf validation/main/map
0 10 0.0001 10.7232 1.27872 9.44453
0 20 0.0001 7.54169 1.11881 6.42288
1 30 0.0001 5.72171 0.97805 4.74366
1 40 0.0001 4.698 0.795822 3.90218
2 50 0.0001 4.30619 0.780229 3.52596
this epoch [###################...............................] 39.52%
50 iter, 2 epoch / 120000 iterations
1.7213 iters/sec. Estimated time to finish: 19:21:24.683424.
How many itteration did you see before converging ? (we look at the validation/main/map to see this right?)
I am also working on a small webbased interface for labelling, my focus is around reducing the barriere for people and enabling less geeky people to be able to train a custom set. If you dont mind me asking, if you could with 3 clicks sign in and upload a folder of images and annotate them and download the annotation like this tool does, do you estimate this having value or does it not matter to set up python and run applications like this`?
My training log is below. It accidentally stopped after 46000 iterations due to full SSD. The last main/loss was around 0.9 (apple-orange dataset) and seemed to converge at 30000 or so.
[
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 5,
"elapsed_time": 79.22051453590393,
"main/loss/loc": 0.7805992364883423,
"iteration": 1000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 2.432619571685791,
"main/loss": 3.213221549987793,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 11,
"elapsed_time": 171.39589166641235,
"main/loss/loc": 0.5678575038909912,
"iteration": 2000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 1.6511310338974,
"main/loss": 2.2189900875091553,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 17,
"elapsed_time": 263.30492639541626,
"main/loss/loc": 0.4849354326725006,
"iteration": 3000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 1.4860771894454956,
"main/loss": 1.971013069152832,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 23,
"elapsed_time": 355.13462233543396,
"main/loss/loc": 0.43977463245391846,
"iteration": 4000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 1.3815630674362183,
"main/loss": 1.8213376998901367,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 29,
"elapsed_time": 445.9900951385498,
"main/loss/loc": 0.38603052496910095,
"iteration": 5000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 1.2903273105621338,
"main/loss": 1.6763561964035034,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 35,
"elapsed_time": 537.2248640060425,
"main/loss/loc": 0.3692229092121124,
"iteration": 6000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 1.2776328325271606,
"main/loss": 1.6468547582626343,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 41,
"elapsed_time": 629.2061340808868,
"main/loss/loc": 0.33294421434402466,
"iteration": 7000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 1.2454191446304321,
"main/loss": 1.5783628225326538,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 47,
"elapsed_time": 719.9931819438934,
"main/loss/loc": 0.3216134309768677,
"iteration": 8000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 1.1471179723739624,
"main/loss": 1.4687303304672241,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 53,
"elapsed_time": 810.4598653316498,
"main/loss/loc": 0.2914847135543823,
"iteration": 9000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 1.0824588537216187,
"main/loss": 1.3739433288574219,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 59,
"elapsed_time": 899.7752742767334,
"main/loss/loc": 0.2933015525341034,
"iteration": 10000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 1.0578984022140503,
"main/loss": 1.3512004613876343,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 65,
"elapsed_time": 989.7144713401794,
"main/loss/loc": 0.2640993297100067,
"iteration": 11000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.9952711462974548,
"main/loss": 1.259369134902954,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 71,
"elapsed_time": 1078.3362791538239,
"main/loss/loc": 0.267562597990036,
"iteration": 12000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 1.0545027256011963,
"main/loss": 1.3220648765563965,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 77,
"elapsed_time": 1167.5032584667206,
"main/loss/loc": 0.2557935416698456,
"iteration": 13000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 1.0422357320785522,
"main/loss": 1.2980290651321411,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 83,
"elapsed_time": 1258.6743831634521,
"main/loss/loc": 0.24655602872371674,
"iteration": 14000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 1.0473687648773193,
"main/loss": 1.2939248085021973,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 89,
"elapsed_time": 1349.2105464935303,
"main/loss/loc": 0.22720617055892944,
"iteration": 15000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.9975312352180481,
"main/loss": 1.224737286567688,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 95,
"elapsed_time": 1440.3668973445892,
"main/loss/loc": 0.23588189482688904,
"iteration": 16000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.9708943367004395,
"main/loss": 1.2067770957946777,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 101,
"elapsed_time": 1530.970549583435,
"main/loss/loc": 0.2335476577281952,
"iteration": 17000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.980478048324585,
"main/loss": 1.2140252590179443,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 107,
"elapsed_time": 1620.2500042915344,
"main/loss/loc": 0.21971601247787476,
"iteration": 18000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.9179012179374695,
"main/loss": 1.1376179456710815,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 113,
"elapsed_time": 1709.0796911716461,
"main/loss/loc": 0.202173113822937,
"iteration": 19000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.894392728805542,
"main/loss": 1.0965656042099,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 119,
"elapsed_time": 1797.5150842666626,
"main/loss/loc": 0.20833085477352142,
"iteration": 20000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.9276025891304016,
"main/loss": 1.1359336376190186,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 125,
"elapsed_time": 1886.2288718223572,
"main/loss/loc": 0.18990658223628998,
"iteration": 21000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.8623952865600586,
"main/loss": 1.0523018836975098,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 131,
"elapsed_time": 1975.2177803516388,
"main/loss/loc": 0.19189295172691345,
"iteration": 22000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.8291560411453247,
"main/loss": 1.021048665046692,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 0.9930069930069931,
"lr": 0.0001,
"epoch": 137,
"elapsed_time": 2064.356647014618,
"main/loss/loc": 0.18369287252426147,
"iteration": 23000,
"validation/main/ap/Orange": 0.986013986013986,
"main/loss/conf": 0.9099375009536743,
"main/loss": 1.093629240989685,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 143,
"elapsed_time": 2153.256175994873,
"main/loss/loc": 0.19064386188983917,
"iteration": 24000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.8053968548774719,
"main/loss": 0.9960411190986633,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 149,
"elapsed_time": 2244.549509525299,
"main/loss/loc": 0.1867237240076065,
"iteration": 25000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.8994454741477966,
"main/loss": 1.0861691236495972,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 155,
"elapsed_time": 2334.299106836319,
"main/loss/loc": 0.18454843759536743,
"iteration": 26000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.8269904255867004,
"main/loss": 1.0115389823913574,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 161,
"elapsed_time": 2424.789150238037,
"main/loss/loc": 0.1966702789068222,
"iteration": 27000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.8844095468521118,
"main/loss": 1.0810796022415161,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 0.9509090909090911,
"lr": 0.0001,
"epoch": 167,
"elapsed_time": 2513.2767944335938,
"main/loss/loc": 0.16937251389026642,
"iteration": 28000,
"validation/main/ap/Orange": 0.901818181818182,
"main/loss/conf": 0.745681881904602,
"main/loss": 0.91505366563797,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 173,
"elapsed_time": 2605.0023188591003,
"main/loss/loc": 0.18461140990257263,
"iteration": 29000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.8689676523208618,
"main/loss": 1.0535788536071777,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 179,
"elapsed_time": 2696.3014690876007,
"main/loss/loc": 0.18137913942337036,
"iteration": 30000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.8449581265449524,
"main/loss": 1.0263373851776123,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 185,
"elapsed_time": 2786.8104159832,
"main/loss/loc": 0.17437873780727386,
"iteration": 31000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.8081318736076355,
"main/loss": 0.9825111627578735,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 191,
"elapsed_time": 2875.750657081604,
"main/loss/loc": 0.1655825674533844,
"iteration": 32000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7402241230010986,
"main/loss": 0.9058061838150024,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 197,
"elapsed_time": 2965.155656814575,
"main/loss/loc": 0.15511693060398102,
"iteration": 33000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7123785018920898,
"main/loss": 0.8674951195716858,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 203,
"elapsed_time": 3054.330314874649,
"main/loss/loc": 0.15841209888458252,
"iteration": 34000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7834293246269226,
"main/loss": 0.9418414235115051,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 209,
"elapsed_time": 3143.364190340042,
"main/loss/loc": 0.17163033783435822,
"iteration": 35000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.787468433380127,
"main/loss": 0.9590983986854553,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 215,
"elapsed_time": 3232.4324741363525,
"main/loss/loc": 0.15274818241596222,
"iteration": 36000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7713798880577087,
"main/loss": 0.9241276979446411,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 221,
"elapsed_time": 3321.141427755356,
"main/loss/loc": 0.14548836648464203,
"iteration": 37000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7560657858848572,
"main/loss": 0.9015540480613708,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 227,
"elapsed_time": 3411.388104200363,
"main/loss/loc": 0.1442233920097351,
"iteration": 38000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7834077477455139,
"main/loss": 0.9276309609413147,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 233,
"elapsed_time": 3500.377106666565,
"main/loss/loc": 0.15287399291992188,
"iteration": 39000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7748618125915527,
"main/loss": 0.9277357459068298,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 239,
"elapsed_time": 3590.9456102848053,
"main/loss/loc": 0.13595137000083923,
"iteration": 40000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7328604459762573,
"main/loss": 0.8688119649887085,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 245,
"elapsed_time": 3680.542897462845,
"main/loss/loc": 0.15140685439109802,
"iteration": 41000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7466314435005188,
"main/loss": 0.8980385065078735,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 251,
"elapsed_time": 3769.9805710315704,
"main/loss/loc": 0.15165714919567108,
"iteration": 42000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7843142151832581,
"main/loss": 0.9359715580940247,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 257,
"elapsed_time": 3859.4798498153687,
"main/loss/loc": 0.13653425872325897,
"iteration": 43000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7511242032051086,
"main/loss": 0.8876590132713318,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 263,
"elapsed_time": 3949.208746910095,
"main/loss/loc": 0.13751305639743805,
"iteration": 44000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7098026275634766,
"main/loss": 0.8473147749900818,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 269,
"elapsed_time": 4038.1616518497467,
"main/loss/loc": 0.1474171131849289,
"iteration": 45000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7404646873474121,
"main/loss": 0.8878818154335022,
"validation/main/ap/Apple": 1.0000000000000002
},
{
"validation/main/map": 1.0000000000000002,
"lr": 0.0001,
"epoch": 275,
"elapsed_time": 4127.079681634903,
"main/loss/loc": 0.13585035502910614,
"iteration": 46000,
"validation/main/ap/Orange": 1.0000000000000002,
"main/loss/conf": 0.7742376327514648,
"main/loss": 0.9100871086120605,
"validation/main/ap/Apple": 1.0000000000000002
}
]
Crazy that you get 11 iteration per sec :) on the best azure single GPU setup I got 1.17 iteration per sec.
What hardware is that ?
Azure is described here: https://azure.microsoft.com/en-us/blog/azure-n-series-general-availability-on-december-1/ 1 x K80 GPU, 6 Cores 56GB Ram.
Is the speed i saw expected or could there be something wrong?
FYI, my additional hardware description: (11.2 iters/sec training time) training locally. Mother ASUS X99-E WS CPU: Core i7 6850K (3.60GHz) 6 cores/12T memory: 32GB (8GBx4) SSD: 500GB SSD S-ATA NVIDIA GTX 1080Ti (11GB) (I have 2 GPUs, but only 1 is used, I guess) Ubuntu14.04
Please make sure that
chainer.cuda.cudnn_enabled == True
.Labelme would be the direct competitor if you are making an web based app for annotation.
HIN0209, thanks for your feedbacks. I will update the documentation.
no, mine was the default batchsize. I can do another experiment one of the following days when I am off work.
also chainer.cuda.cudnn_enabled was true.
I got to 9.5 itterations per sec with batchsize 1 :) From the theory i did in school i think the conclusion was that Stochastic Gradient Descent did not work properly with batchsize of 1 though.
but it dropped after running a bit
what are your recommendation for batchsize?
Another request is a python code to convert .xml to .json (used here) for the VOC dataset to be used here. I only found a script to do the other way around or just simply changing the format, but not exactly correcting the BBox location (from xmin, ymin, etc to the center/box size).
Thanks!
what are your recommendation for batchsize?
Usually, as big as you can make if your dataset is big. That way, GPU is fully utilized.
Another request is a python code to convert .xml to .json (used here) for the VOC dataset to be used here. I only found a script to do the other way around or just simply changing the format, but not exactly correcting the BBox location (from xmin, ymin, etc to the center/box size).
Thanks for your suggestion, but it seems that this is out of the scope for the project. I do not need to specialize the project to VOC dataset.
Fine. I understand your limited time. Having said, LabelMe creates .xml, such that it does not help to this apple-orange project, either. There should be someone having similar requests, so I will look around other github projects.
@HIN0209 If you provides me the links to the VOC dataset labels i can properly put something together that converts it to the json format used here.
Thanks. Here are the links to the VOC datasets (2007 and 2012), according to another excellent github site. https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
http://pjreddie.com/media/files/VOCtrainval_11-May-2012.tar http://pjreddie.com/media/files/VOCtrainval_06-Nov-2007.tar http://pjreddie.com/media/files/VOCtest_06-Nov-2007.tar
I have tried to train my custom dataset, by modifying the codes provided by chainercv/examples/ssd and faster_rcnn. It turned out that code correction is needed to many .py files locating so deeply (even in anaconda3/env), and I almost gave up. This orange-apple project provides a much simpler code, hopefully I can train/test my own data with this. During the process of using chainercv/example codes, I have experienced "no response forever" as we discussed earlier in the respective training codes as well. Hope this is solved in more general codes.
I will take a look at http://pjreddie.com/media/files/VOCtrainval_11-May-2012.tar and see how hard it is to convert it to same structure as this project. That is what you are asking right?
A folder with yml file containing the classes and then a folder per class. For each image in each folder there is a json file with the bounding boxes ?
I assume there can be images that have multiple classes, should those be represented in all class folders?
converting from xml to json is found here. I can try this later. https://github.com/NervanaSystems/neon/tree/master/examples/faster-rcnn
I assume there can be images that have multiple classes, should those be represented in all class folders?
There can be multiple classes in one image.
The directory name is arbitrary.
The output only depends on what is written in .json
.
source code https://github.com/yuyu2172/image-labelling-tool/blob/master/examples/ssd/original_detection_dataset.py
@HIN0209 Please can I have the weights trained for the model on orange-apple dataset?
@rida1896 Thanks for your comment. Actually, my comment is now outdated. If you could do that on Mask-RCNN, it would be great.
@HIN0209 Do you have the dataset with annotations?
Hello
First of, i really appreciate how simple it was to get started with the info in the the README. My background is that I understand the principals of deep networks from theory and have not done anything real with it.
So coming here and with less than one hour work I was able to (on windows) to set up a docker container running chainercv + running a faster rcnn detection on a sample image, was extremely satisfying.
Would it not be possible to author the same experience for a simple training of a small getting started tutoring with own data?
Lets say a guy have 10 images and he takes the time to mask them, put them in a folder structure and then run a simple demo.py script that trains on 9 of them and do a detect on the last. I know this is will yield poor results, but these kind of demos really accelerate the adoption because its something everyone can do and it yield much more value when people can do it on their own use cases instead always the same sample data sets that always works.
if anyone can contribute with a folder structure and demo script to train, i can write up the tutorial and documentation to help peolpe