Multiple threads slower than one thread for decoder?

TL;DR

It occurred to me that generating predictions for larger datasets was faster with --threads 1 instead of --threads 4 and this seems counter-intuitive to me.

(fixed-tree decoder)

Background:

I am working on the cogs branch which is branched off from JG's unsupervised branch. So due to the introduction of automata in the unsupervised case, I had to adapt the code for the prediction a bit (I ended up with my own prediction bash script for now which is heavily inspired by scripts/predict.sh and also had to change the dataset reader and predictor class used in parse_file.py). Both my own prediction script and the original predict.sh internally call parse_file.py (calls fixed-tree decoder). One can specify the number of threads, but the default is 4. I'm trying to generate predictions on the test, dev and gen set of COGS.

The problem:

For my experiments I used k=6 supertags so far, used the give up option (tried 5 and 15 seconds) throughout and started with the default of 4 threads, all this was run on jones-2 with the gpu option used. On small datasets (50-500) this usually worked, but as soon as I tried to run it on larger ones (sometimes stuck on 1k samples, but definitely for the 3k test and dev sets), the script ran forever and didn't complete. Even the 3k samples dev set wasn't done after half a day, although during training (>6hrs for 100 epochs, 92 of them with validation turned on) it was possible. I expected the runtime to grow approximately linear with increasing dataset size, but this wasn't the case. I know there is some noise, but if a 50 samples set needs less than a minute why is the full set (3k=50*60) not finished after half a day, even more so with the give up parameter set to just a few seconds? Also it seemed to me that the progam gets stuck pretty quickly (like minutes), according to the 'last modified' timestamps of the log files (note that I added a print statement, so I know when a new sentence has been decoded) and also summing up the runtime info printed for each sentence.

I looked at the log file and even inserted a print statement in the graph_dependency_parser/am_algebra/label_decoder.pyx:call_viterbi to verify that it's not just one outlier sentence that causes the whole process to be stuck (notice that during the whole time I used the --give_up parameter). Most sentences took less than a second, with a few outliers taking several seconds (due to the backing off then). However, rerunning it on different prefixes of the same corpus files revealed mixed patterns, so it doesn't seem to be a specific sentence that is difficult for the decoder.

Because I know from the successful training that the dev set should be doable quickly, I looked at the config file used during training again and found that only one thread is used there for evaluation. Once I changed my prediction script to only using one thread, dev and test set (3k samples each) were done in minutes or less (haven't timed it), and the other 21k set needed less than 1.5 hours (this one contained some very long sentences).

@namednil Do you have any ideas about this?

I don't have time to investigate this further right now, so far I'm happy with using just one thread and the current runtimes are ok-ish (of course, faster is always better). I'm opening this issue to first of all to document my surprise that prediction was faster with one compared to 4 threads.

coli-saar / am-parser