A3C-Doom: worker_0 plot is lost

IbrahimSobh commented 7 years ago

Dear

When using:

tensorboard --logdir=worker_0:'./train_0',worker_1:'./train_1',worker_2:'./train_2',worker_3:'./train_3'

worker_0 is not plotted

awjuliani commented 7 years ago

Hi Ibrahim,

Did you encounter an error in the worker_0 process? Otherwise it should certainly be plotting.

DMTSource commented 7 years ago

I have been having an issue that I think might be related/explain what you are seeing. I noticed when running the a3c doom code on 8-16 cpus, that sometimes one or two threads would not launch or at least were failing silently. The name of the thread, when printed from worker.work or other places, could be seen as being repeated or mixed up. So I would have two "1" threads or some other repeated index in the worker name.

When extending the code for personal use I ran into this repeatedly as my environment was much lighter and faster to start-up a worker. To fix the issue I added a simple sleep(0.5) (see code below). Now, when I print, from the worker.work, the name of the thread I no longer see repeated items and there is no longer a mix up of print locations and other bugs caused by the issue.

It appears workers were spooling up too quickly in my case and repeated or mixing up their context? I'm used to Scoop or Multiprocessing modules so I am unsure if this is a common issue with global scope and Threading?

for worker in workers:
        worker_work = lambda: worker.work(max_episode_length,gamma,sess,coord,saver)
        t = threading.Thread(target=(worker_work))
        t.start()
        worker_threads.append(t)
        sleep(0.5)
    coord.join(worker_threads)

awjuliani commented 7 years ago

Thanks for the suggesting DMTSource! I have incorporated the sleep line into the notebook.

IbrahimSobh commented 7 years ago

I have added the sleepas follows:

    worker_threads = []
    for worker in workers:
        worker_work = lambda: worker.work(max_episode_length,gamma,sess,coord,saver)
        t = threading.Thread(target=(worker_work))
        t.start()
        worker_threads.append(t)
        sleep(0.5) # here is it
    coord.join(worker_threads)

worker_0 is in orange color

w_0

However, worker_0 seems to stop very early
Moreover, model and frames are not saved (the code for them is based on worker_0)

Then I used worker_1 instead of worker_0 for saving model and frames, but then worker_1 stopped

w_1

I tried sleepin the code that is responsible for saving model and frames ... but the same problem.

Regards

DMTSource commented 7 years ago

So plots are showing and all workers are alive and well with their respective names...until what appears to be step ~10(saving time).

Looks like there is trouble with the model saving as the "master" worker is making it to that point and then shutting down. If the crash is truly silent you might want to add some print statements to study how far the code is getting once it reaches the code block relevant to saving the model.

My ignorant guess is something like ffmpeg is causing the trouble as its a very external tool to this code, and saving checkpoint files should be trivial for Tensorflow despite the system. You could try commenting out the gif generation code if that is the case. I had trouble getting a working ffmpeg installation on my system the first time I ran the code as some versions threw errors(Ubuntu 14.04). But I was able to get it working once the issue was identified.

IbrahimSobh commented 7 years ago

Possible cause:

After removing model and gif saving code, things worked fine!

Removed code:

                    if self.name == 'worker_1' and episode_count % 25 == 0:
                        time_per_step = 0.05
                        images = np.array(episode_frames)
                        make_gif(images,'./frames/image'+str(episode_count)+'.gif',
                            duration=len(images)*time_per_step,true_image=True,salience=False)
                    if episode_count % 250 == 0 and self.name == 'worker_1':
                        saver.save(sess,self.model_path+'/model-'+str(episode_count)+'.cptk')
                        print ("Saved Model")

Figure: (all threads are there) I think I have some error in saving!

w_no_save

Any clue?

Regards

IbrahimSobh commented 7 years ago

Thanks DMTSource

I can save the model but not the frames!

any other way to save gifs or video?

awjuliani commented 7 years ago

Hi Ibrahim,

Are you sure that you have both moviepy and ffmpeg installed? You will also need to ensure the version of imageio you have is 1.6.

IbrahimSobh commented 7 years ago

Hi Arthur

imageio print imageio.--version-- 2.1.2

ffmpeg -version ffmpeg version N-80901-gfebc862 Copyright (c) 2000-2016 the FFmpeg developers built with gcc 4.8 (Ubuntu 4.8.4-2ubuntu1~14.04.3) configuration: --extra-libs=-ldl --prefix=/opt/ffmpeg --mandir=/usr/share/man --enable-avresample --disable-debug --enable-nonfree --enable-gpl --enable-version3 --enable-libopencore-amrnb --enable-libopencore-amrwb --disable-decoder=amrnb --disable-decoder=amrwb --enable-libpulse --enable-libfreetype --enable-gnutls --enable-libx264 --enable-libx265 --enable-libfdk-aac --enable-libvorbis --enable-libmp3lame --enable-libopus --enable-libvpx --enable-libspeex --enable-libass --enable-avisynth --enable-libsoxr --enable-libxvid --enable-libvidstab libavutil 55. 28.100 / 55. 28.100 libavcodec 57. 48.101 / 57. 48.101 libavformat 57. 41.100 / 57. 41.100 libavdevice 57. 0.102 / 57. 0.102 libavfilter 6. 47.100 / 6. 47.100 libavresample 3. 0. 0 / 3. 0. 0 libswscale 4. 1.100 / 4. 1.100 libswresample 2. 1.100 / 2. 1.100 libpostproc 54. 0.100 / 54. 0.100

awjuliani commented 7 years ago

I believe you will need imageio 1.6, and not 2.1 in order for the gif generation to work. Unfortunately they changed the encoder in 2.1 and broke the gif code I used. If you have a fix that works with 2.1, I would be happy to incorporate it.

awjuliani / DeepRL-Agents

A3C-Doom: worker_0 plot is lost #14