Closed IbrahimSobh closed 7 years ago
Hi Ibrahim,
Did you encounter an error in the worker_0 process? Otherwise it should certainly be plotting.
I have been having an issue that I think might be related/explain what you are seeing. I noticed when running the a3c doom code on 8-16 cpus, that sometimes one or two threads would not launch or at least were failing silently. The name of the thread, when printed from worker.work or other places, could be seen as being repeated or mixed up. So I would have two "1" threads or some other repeated index in the worker name.
When extending the code for personal use I ran into this repeatedly as my environment was much lighter and faster to start-up a worker. To fix the issue I added a simple sleep(0.5) (see code below). Now, when I print, from the worker.work, the name of the thread I no longer see repeated items and there is no longer a mix up of print locations and other bugs caused by the issue.
It appears workers were spooling up too quickly in my case and repeated or mixing up their context? I'm used to Scoop or Multiprocessing modules so I am unsure if this is a common issue with global scope and Threading?
for worker in workers:
worker_work = lambda: worker.work(max_episode_length,gamma,sess,coord,saver)
t = threading.Thread(target=(worker_work))
t.start()
worker_threads.append(t)
sleep(0.5)
coord.join(worker_threads)
Thanks for the suggesting DMTSource! I have incorporated the sleep
line into the notebook.
I have added the sleep
as follows:
worker_threads = []
for worker in workers:
worker_work = lambda: worker.work(max_episode_length,gamma,sess,coord,saver)
t = threading.Thread(target=(worker_work))
t.start()
worker_threads.append(t)
sleep(0.5) # here is it
coord.join(worker_threads)
worker_0 is in orange color
Then I used worker_1 instead of worker_0 for saving model and frames, but then worker_1 stopped
I tried sleep
in the code that is responsible for saving model and frames ... but the same problem.
Regards
So plots are showing and all workers are alive and well with their respective names...until what appears to be step ~10(saving time).
Looks like there is trouble with the model saving as the "master" worker is making it to that point and then shutting down. If the crash is truly silent you might want to add some print statements to study how far the code is getting once it reaches the code block relevant to saving the model.
My ignorant guess is something like ffmpeg is causing the trouble as its a very external tool to this code, and saving checkpoint files should be trivial for Tensorflow despite the system. You could try commenting out the gif generation code if that is the case. I had trouble getting a working ffmpeg installation on my system the first time I ran the code as some versions threw errors(Ubuntu 14.04). But I was able to get it working once the issue was identified.
Possible cause:
After removing model and gif saving code, things worked fine!
Removed code:
if self.name == 'worker_1' and episode_count % 25 == 0:
time_per_step = 0.05
images = np.array(episode_frames)
make_gif(images,'./frames/image'+str(episode_count)+'.gif',
duration=len(images)*time_per_step,true_image=True,salience=False)
if episode_count % 250 == 0 and self.name == 'worker_1':
saver.save(sess,self.model_path+'/model-'+str(episode_count)+'.cptk')
print ("Saved Model")
Figure: (all threads are there) I think I have some error in saving!
Any clue?
Regards
Thanks DMTSource
I can save the model but not the frames!
any other way to save gifs or video?
Hi Ibrahim,
Are you sure that you have both moviepy and ffmpeg installed? You will also need to ensure the version of imageio you have is 1.6.
Hi Arthur
imageio print imageio.--version-- 2.1.2
ffmpeg -version ffmpeg version N-80901-gfebc862 Copyright (c) 2000-2016 the FFmpeg developers built with gcc 4.8 (Ubuntu 4.8.4-2ubuntu1~14.04.3) configuration: --extra-libs=-ldl --prefix=/opt/ffmpeg --mandir=/usr/share/man --enable-avresample --disable-debug --enable-nonfree --enable-gpl --enable-version3 --enable-libopencore-amrnb --enable-libopencore-amrwb --disable-decoder=amrnb --disable-decoder=amrwb --enable-libpulse --enable-libfreetype --enable-gnutls --enable-libx264 --enable-libx265 --enable-libfdk-aac --enable-libvorbis --enable-libmp3lame --enable-libopus --enable-libvpx --enable-libspeex --enable-libass --enable-avisynth --enable-libsoxr --enable-libxvid --enable-libvidstab libavutil 55. 28.100 / 55. 28.100 libavcodec 57. 48.101 / 57. 48.101 libavformat 57. 41.100 / 57. 41.100 libavdevice 57. 0.102 / 57. 0.102 libavfilter 6. 47.100 / 6. 47.100 libavresample 3. 0. 0 / 3. 0. 0 libswscale 4. 1.100 / 4. 1.100 libswresample 2. 1.100 / 2. 1.100 libpostproc 54. 0.100 / 54. 0.100
I believe you will need imageio 1.6, and not 2.1 in order for the gif generation to work. Unfortunately they changed the encoder in 2.1 and broke the gif code I used. If you have a fix that works with 2.1, I would be happy to incorporate it.
Dear
When using:
tensorboard --logdir=worker_0:'./train_0',worker_1:'./train_1',worker_2:'./train_2',worker_3:'./train_3'
worker_0 is not plotted