crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
44 stars 11 forks source link

Multithread reading from queues #66

Closed jnioche closed 1 year ago

jnioche commented 1 year ago

See #41

Work in progress, currently returning more results than asked by the client. @rzo1 any suggestions on how to fix that?

jnioche commented 1 year ago

The commit faf5c03 prevents the case where threads send results despite the number of queues already reached by keeping a counter within the SynchronizedStreamObserver.

To test the perfs, I populated a Frontier with the same list of URLs as #63

Number of queues: 801800
Active URLs: 20320328
In process: 0
active_queues = 801800
completed = 0

and measured the time taken when running

for i in {1..20}; do java -jar client/target/urlfrontier-client-2.3-SNAPSHOT.jar -p 7071 GetURLs -q 100 | grep "Total time"; done

Number of Threads Number of queues Average time per queue
1 100 4.45
1 500 4.25
1 1000 2.56
3 100 2.02
3 500 0.57
3 1000 0.38
5 100 2.61
5 500 0.79
5 1000 0.45
10 100 2.13
10 500 0.75
10 1000 0.48

The overhead of the grpc call could explains why it is disproportionally faster to request a large number of queues in one go + the time to transfer the results would be the same regardless of the number of threads used. When looking at the Frontier logs, we can see that the time it takes internally to retrieve the results is substantially faster.

Using 3 threads seems to give the best performance.

When using a single thread I can see

Exception in thread "pool-1-thread-702" java.lang.IllegalStateException: Stream is already completed, no further calls are allowed
    at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
    at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:375)
    at crawlercommons.urlfrontier.service.SynchronizedStreamObserver.onNext(SynchronizedStreamObserver.java:56)
    at crawlercommons.urlfrontier.service.rocksdb.RocksDBService.sendURLsForQueue(RocksDBService.java:339)
    at crawlercommons.urlfrontier.service.AbstractFrontierService.lambda$getURLs$0(AbstractFrontierService.java:673)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

will have a look at that later.

rzo1 commented 1 year ago

Using 3 threads seems to give the best performance.

Think it depends on the system running the software (docker, bare metal, etc). Perhaps, we could use a default value based on the available cores.

jnioche commented 1 year ago

Using 3 threads seems to give the best performance.

Think it depends on the system running the software (docker, bare metal, etc). Perhaps, we could use a default value based on the available cores.

yes, if there is a way of doing it programmatically, this would be a good default. What my tests showed was that I got the best results with # cores / 2

rzo1 commented 1 year ago

Using 3 threads seems to give the best performance.

Think it depends on the system running the software (docker, bare metal, etc). Perhaps, we could use a default value based on the available cores.

yes, if there is a way of doing it programmatically, this would be a good default. What my tests showed was that I got the best results with # cores / 2

Maybe something like

Runtime.getRuntime().availableProcessors();

which includes hyper threading.

jnioche commented 1 year ago

thanks for your comments @rzo1, see fcbd628c for the default number of threads being based on the available procs