crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
44 stars 11 forks source link

Use multiple threads for putting URLs #63

Closed jnioche closed 1 year ago

jnioche commented 2 years ago

Similar to #41 but on the writing side. A stream of incoming updates is processed by a single thread. To make things parallel, a client needs to create multiple connections to the server. It would be more flexible to have an ExecutorService with n threads in AbstractFrontierService and use it in the method putURLs, possibly changing https://github.com/crawler-commons/url-frontier/blob/master/service/src/main/java/crawlercommons/urlfrontier/service/AbstractFrontierService.java#L767 to return a Future. This should result in a better use of the CPU and increased overall performance.

jnioche commented 2 years ago

@FelixEngl is this something you could be interested in having a look at?

FelixEngl commented 2 years ago

Yes, I'll look into it. (I think I'll have some free time next week or so.)

FelixEngl commented 2 years ago

Do you know how I can pin an issue to my github-dashboard?

Or does this only work when someone is assigned to it? Then it would be nice, if you could assign me to this issue. I'm afraid, that I forget it when I don't see it on my GH-page. (same as for the status-bold-thing)

rzo1 commented 2 years ago

Think it needs to be assigned to you.

FelixEngl commented 2 years ago

Thanks!

jnioche commented 1 year ago

I'll have a go at this one, will ask @rzo1 and @FelixEngl for feedback once I have produced something.

FelixEngl commented 1 year ago

Hi @jnioche,

uff, I totally forgot about this. (You know how it goes sometimes...)

Right now I have a lot of things to do (lots of family stuff etc.), hence I can not tell when I have some free "just-some-programming"-time. Sorry for not notifying you.

jnioche commented 1 year ago

Have pushed an initial attempt in branch 63 see 8bcce55b90b4ca790a7fe5a2c5d5fcef8d697358

on my computer with 6 cores x 2 threads I am getting the following average OPS with RocksDB as a backend

1 thread -> 52.6K 2 threads -> 76.9K 3 threads -> 71.4K 5 threads -> 62.5K 10 threads -> 58.8K

with the memory based implementation

1 thread -> 83K 2 threads -> 71K 5 threads -> 71K 10 threads -> 66k

the latter is less relevant as it it probably not used apart from the unit testing but nonetheless it shows that there are overheads in using the multithreading. For the RocksDB based implementation, there is a gain in using more than one thread but it is not as big as I thought it would be.

I will run the profiler while benchmarking to get a better understanding of where things could be improved. This is just an initial attempt and there is certainly more to be gained.

This is also a worse-case scenario as the benchmarks were done on an empty Frontier; the put operations are fast (which is why the memory based stuff doesn't show any gains) but as the number of URLs in the Frontier grows, the puts will take longer and the benefits of the multithreading would be better felt.

jnioche commented 1 year ago

Using a more realistic list of URLs containing 26M entries (20.3M of which are unique) going into a total of 801800 queues.

In the current version, the RAM-based backend does not do multi-threading (because it can't be configured) but again, no one uses it for real. It can't handle so many URLs with the amount of RAM I have on my computer anyway.

Sent: 25948931 OK: 20320328 Skipped: 5628603

Number of Threads Time in secs KOPS Improvement
1 613 42.3 0%
2 497 52.3 23.64%
3 397 65.3 54.37%
5 426 60.9 7.22%
10 467 55.5 -8.86%

This shows a comparable improvement gain when used at scale but also revealed a potential issue as the number of skipped URLs differs from one run to another, while the total number of URLs acked is the same. This could be due to copies of the same URL coming in at the same time and being processed by two different threads.

jnioche commented 1 year ago

Fixed in 3442f6fa