dair-iitd / OpenIE-standalone

Other
565 stars 73 forks source link

Multicore support? #22

Open hthuwal opened 5 years ago

hthuwal commented 5 years ago

I am trying to run it on a 1.5 GB text file. The model uses only a single core and hence it's taking too long.

I couldn't find a flag to specify the number of threads to use. Is there a way to run the model on multiple cores?

guilherme-salome commented 5 years ago

I couldn't either, an easy workaround is to just split the text file and launch two processes, one for each half of the file.

hthuwal commented 5 years ago

Yeah Running several processes on file splits is exactly what I am doing right now.

But it would be nice to have some flag/method that allows using all cores like the stanfordNLP does.

On Thu 4 Oct, 2018, 1:06 AM Guilherme Salomé, notifications@github.com wrote:

I couldn't either, an easy workaround is to just split the text file and launch two processes.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dair-iitd/OpenIE-standalone/issues/22#issuecomment-426770552, or mute the thread https://github.com/notifications/unsubscribe-auth/AO5Ljxqtq0GUnMfjY0TRaEGudUAqXBXfks5uhRG-gaJpZM4XF42I .

guilherme-salome commented 5 years ago

If you find a more efficient solution or update the code to allow for multicores please post it here!

guilherme-salome commented 5 years ago

@hthuwal I've been using this project to go over a lot of text and I was running it in a single powerful machine and it was veeery slow. I then went to digitalocean and got one of the high tier droplets and started running openie5 in parallel (with https://www.gnu.org/software/parallel/) and on a small number of phrases (1000 at a time, more to debug really, but could be increased). The processing time was about 3 minutes for each 1000 (1 min for loading up open ie 5 more or less).

At first I was trying with the -Xmx10g -XX:+UseConcMarkSweepGC option and it was not working at all, no lines were being parsed. This options seems to work in RedHat and MacOs but did not work for me in Ubuntu 18.04. I removed the options and it started working. However, I noticed that the memory usage was higher than 10gb, about 13gb per openie 5 process. I was also using top to monitor cpu usage, and each process was using about 150% CPU on average. With 64gb of RAM I was able to run 4 processes simultaneously (a 5th would crash because of low memory).

The droplet I was using has 64gb of RAM + 32vCPUs. The type of the droplet is: "CPU Optimized droplet". There is another type that is called "Standard Droplets", and its highest tier has 192gb of RAM + 32vCPUs. Since the bottleneck in the CPU Optimized droplet was the RAM, it may be possible to run more processes in the Standard droplet, even though there the CPUs are less powerful.

guilherme-salome commented 5 years ago

Update: I tested their Standard Droplet with 192gb of memory and 32vCPUs and I was able to run 8 processes at the same time. That consumed 92% ish of the memory. The average CPU use was 1200%. So the bottleneck is definitely memory.

Update: Looking at the top output it seems there is still some memory free with 8 processes, so I think maybe 9 or 10 could run in parallel. 12 processes definitely does not work, neither 11.

Anyways, maybe this can help you speed up. Btw digital ocean (referral link) is giving $100 for use during october. That can buy about 60 hours of the most expensive droplet.

hthuwal commented 5 years ago

Thanx @Salompas. Yes, memory is the bottleneck because the process requires ~10 GB of memory just to run. I have access to a machine with about ~80GB of RAM and 32 cores. I was able to run 3 processes simultaneously. Any further increase in the number of processes chokes up the machine.

Thanx for reminding about the parallel command. I totally forgot about this and wrote a script that splits the data and spawns processes in multiple tmux windows.

bhadramani commented 5 years ago

openie 4.2 + had multi core support with multi threaded environment ( With approx constant RAM usage ) . Performance recommendation

  1. For N threads use N=1 core . You may observe Nx improvement up to 8 cores.
  2. Use taskset.

Swarna may update , OpenIe 5.x is thread safe?

bhadramani commented 5 years ago

One more performance related suggestion, reading the files is costly , so smaller chunks must help. Choosing chunk size is another smart thing to do. Similarly writing the output , should be done smartly ( For very very large data , may consider using RabittMQ or any similar system , which maintains the Q and save asynchronously.

ambujpd commented 4 years ago

@vaibhavad @swarnaHub @harrysethi @schmmd @bhadramani Could you please suggest on the approach to multicore support? Alternatively, is it possible to load the model in a separate process such that it can be shared (since model size is one of the major bottlenecks)?

I tried to naively use concurrent Futures in Scala and divided sentences among them (in OpenIECli.scala). (I found OpenNLP Chunker as non thread-safe so I put it in blocking{}). But this is not giving me any improvement. For 8 concurrent futures (and 80 sentences), run time is slightly slower than serial. The extracts are getting serialized at some point, although they are running in different threads.

PS: I also see some nThreads set to 1 in some targets:

edu/stanford/nlp/models/pos-tagger/wsj-0-18-left3words-nodistsim.tagger.props:                nthreads = 1
edu/stanford/nlp/models/pos-tagger/english-bidirectional/english-bidirectional-distsim.tagger.props:                nthreads = 1
edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger.props:                nthreads = 1
ambujpd commented 4 years ago

@vaibhavad @swarnaHub @harrysethi @schmmd @bhadramani Could you please suggest on the approach to multicore support? Alternatively, is it possible to load the model in a separate process such that it can be shared (since model size is one of the major bottlenecks)?

I tried to naively use concurrent Futures in Scala and divided sentences among them (in OpenIECli.scala). (I found OpenNLP Chunker as non thread-safe so I put it in blocking{}). But this is not giving me any improvement. For 8 concurrent futures (and 80 sentences), run time is slightly slower than serial. The extracts are getting serialized at some point, although they are running in different threads.

PS: I also see some nThreads set to 1 in some targets:

edu/stanford/nlp/models/pos-tagger/wsj-0-18-left3words-nodistsim.tagger.props:                nthreads = 1
edu/stanford/nlp/models/pos-tagger/english-bidirectional/english-bidirectional-distsim.tagger.props:                nthreads = 1
edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger.props:                nthreads = 1

The multithreaded implementation is working now, giving 4X improvement with 6 threads (tried on a 20-core machine. Increasing threads further showed no further improvement). The reason it wasn't showing any improvement earlier was that I was using too less heap memory - 10G. Increasing 10G to 12G gave substantial improvement in runtime already (around 10X in extractions).

vaibhavad commented 4 years ago

@ambujpd Glad to know that multithreading implementation is working. Can you share the changes you made to make it work? in a pull request? We can test them and merge them with the codebase.

ambujpd commented 4 years ago

@vaibhavad

With higher number of threads (8+), I sporadically see one or two sentences (out of 80) throwing NullPointerException from OpenNLP Chunker, even though I've put that call within Blocking. I'm looking into it currently.

moinnadeem commented 3 years ago

@ambujpd Hey! Are you able to share your multithreaded implementation? It would be super useful for me personally, and cut down my development time by quite a bit. Happy to spend time on the code to help if necessary

ambujpd commented 3 years ago

@moinnadeem I don't have the code with me unfortunately (I remember I was able able to use some thread-safe NLP chunker, along with Scala concurrency and had gotten rid of sporadic NullPointerException issue). But in conclusion I found it was not worth the effort as scalability was quite limited. A much better alternative is multi-processing (at the cost of extra memory) which I eventually ended up using.

vaibhavad commented 3 years ago

Hi @ambujpd @moinnadeem @bhadramani @hthuwal @Salompas,

We have just released a neural OpenIE system - OpenIE6, which is better in performance and at least 10x faster than OpenIE-5 (if you run it on a GPU). You can check it out here - https://github.com/dair-iitd/openie6