duydvu / Scalable-VnCoreNLP

Easily increase the capacity of the VnCoreNLP service to handle huge dataset.
GNU General Public License v3.0
2 stars 2 forks source link

why speed of your code is slower than py_vncorenlp #1

Open duongkstn opened 1 year ago

duongkstn commented 1 year ago

Hi, I ran your code with 1000 samples from a list of texts (texts)

from vncorenlp import VnCoreNLP
annotator = VnCoreNLP('http://localhost', 8000)
times = 1000
start = time.time()
for i in range(times):
    annotator.tokenize(texts[i])
end = time.time()

and compared the total time with py_vncorenlp code:

import py_vncorenlp
annotator = py_vncorenlp.VnCoreNLP(annotators=["wseg"], save_dir="..../VnCoreNLP")
times = 1000
start = time.time()
for i in range(times):
    annotator.word_segment(texts[i])
end = time.time()

The result I got is that your code is much slower than py_vncorenlp version. Am I wrong at something, maybe my testing approach is wrong ? please let me know your solution

duydvu commented 1 year ago

Hi @duongkstn , your code is requesting the server sequentially, which means that there is only at most one container processing the request at a time. You need to create a thread or process pool to call word_segment concurrently using threading or multiprocessing. Also, you might need to monitor the CPU usage of each container with docker stats

duongkstn commented 1 year ago

May I ask you a question ? why did you use nginx here ? what are its advantages ?

duydvu commented 1 year ago

Because you can only bind 1 container to 1 port in the host. So if you remove Nginx and bind vncorenlp to port 8000, you will get an error when scaling since that port is already allocated. (link)

With Nginx, you only need to bind it to port 8000 and balance the load to all vncorenlp containers, which don't have to bind to any port in the host.

duongkstn commented 1 year ago

@duydvu I tested again but there are some problems: Here are my 4 methods:

  1. py_vncorenlp

    import py_vncorenlp
    annotator = py_vncorenlp.VnCoreNLP(annotators=["wseg"], save_dir="..../VnCoreNLP")
    times = 1000
    kq = []
    start = time.time()
    for i in range(times):
    kq.append(annotator.word_segment(texts[i]))
    end = time.time()
  2. Sequentially call your docker service (sudo docker compose up --scale vncorenlp=4):

    from vncorenlp import VnCoreNLP
    annotator = VnCoreNLP('http://localhost', 8000)
    times = 1000
    kq = []
    start = time.time()
    for i in range(times):
    kq.append(annotator.tokenize(texts[i]))
    end = time.time()
  3. Concurrently call your docker service (sudo docker compose up --scale vncorenlp=4):

    
    from vncorenlp import VnCoreNLP
    from threading import Thread
    annotator = VnCoreNLP('http://localhost', 8000)
    times= 1000
    threads = [None] * 4  # since your --scale params is 4, so I choose 4 as the number of threads
    kq = [None] * times

def call_docker_service(list_texts, result, indices): for j, index in enumerate(indices): result[index] = annotator.tokenize(list_texts[j])

batch_size = 1000 // 4 # 250 samples each threads start = time.time() for i in range(4): _start = i batch_size _end = (i + 1) batch_size threads[i] = Thread(target=call_docker_service, args=(texts[_start: _end], kq, list(range(_start, _end)))) threads[i].start() for i in range(len(threads)): threads[i].join() end = time.time()



4. `VnCoreNLP` but with `threading` (Instead using your docker service, I use command: `vncorenlp -Xmx2g .../VnCoreNLP -p 8012 -a "wseg"`
The client code is same as method 3, but with `annotator = VnCoreNLP('http://localhost', 8012)`

(8012 is just a random number)

And here are my results (`end - start`): 
1. solution 1 took 1.5912044048309326 seconds 
2. solution 2 took 9.347352981567383 seconds
3. solution 3 took 3.405658006668091 seconds
4. solution 4 took 2.7315216064453125 seconds

- Like you said, "Concurrently" is better than "Sequentially" (3 better than 2). I agree !
-  1 is always the best solution. (Even it is sequential)
- Sometimes 3 better than 4, sometimes 4 better than 3, so I do not know exactly what improvements of your code (method 3) compared to concurrently calling to `VnCoreNLP` (method 4) ?

Please let me know your solution, maybe I am wrong something ? Are there any faster solutions ? Maybe my `threading` code is wrong ?
Thanks
duydvu commented 1 year ago

It seems that multiprocessing is better than threading. I tested it myself and saw that using many threads will cause a bottleneck since Python has GIL.

Here is my code:

from vncorenlp import VnCoreNLP
from multiprocessing import Pool
import time

annotator = VnCoreNLP('http://localhost', 8000)
n = 10
batch_size = 10000 // n

def call_docker_service(i):
    for _ in range(batch_size):
        annotator.tokenize('hôm nay tôi đi học')

start = time.time()
with Pool(16) as pool:
    pool.map(call_docker_service, range(n))
end = time.time()

print(end - start)

In my setup, using multiprocessing is 3 times faster than using threading.

duongkstn commented 1 year ago

Hi, your above code is ran successfully. but when batch_size = 100000 // n or batch_size = 1000000 // n (more zeros). I got the following error: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer') . is this bug of nginx or VnCoreNLP ?

image

Have you ever met this error ? And let me know how to fix it. Thanks ~!

duydvu commented 1 year ago

@duongkstn This is expected when you call too many requests to the server. You will encounter this error more often as the number of processes increases.

Just simply try and catch this error and retry. But when the error occurs too many times, it indicates that you have reached the limit of the server, so try to increase the number of containers or decrease the number of processes.