crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
44 stars 11 forks source link

Forwarding the config map in DistributedFrontierService constructor (Issue #73) #80

Closed michaeldinzinger closed 1 year ago

michaeldinzinger commented 1 year ago

Signed-off-by: Michael Dinzinger michael.dinzinger@uni-passau.de

Thanks for contributing to URL Frontier, your efforts are appreciated!

Developer Certificate of Origin

By contributing to URL Frontier, you accept and agree to the following terms and conditions (the Developer Certificate of Origin) for your present and future contributions submitted to URL Frontier. Please refer to the Developer Certificate of Origin section in CONTRIBUTING.md for details.

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Before opening a PR, please check that:

Thanks!

jnioche commented 1 year ago

Thanks @michaeldinzinger It has been a while since I last looked at that part of the code. As it stands, the Ignite implementation overrides putURLs from the abstract class, which is where the multithreading is used but is does call getURLs so at least there would be multithreading on the reads. Ideally we should check that adding the super call to DistributedFrontierService does not have a negative impact on ShardedRocksDBService but I think all it would do is instantiate the executor services despite them not being used by ShardedRocksDBService itself.

michaeldinzinger commented 1 year ago

Ideally we should check that adding the super call to DistributedFrontierService does not have a negative impact on ShardedRocksDBService but I think all it would do is instantiate the executor services despite them not being used by ShardedRocksDBService itself.

I don't think that the added line super(configuration); in ShardedRocksDBService has a negative impact. However, what is still kind of improvable is that the readExecutorService and the writeExecutorService are instantiate twice whenever ShardedRocksDBService is used (one extra time for the RocksDBService instance (line 38), even though this one never uses multithreading as far as I see).

This is how it looked before:

michael@pc:~/Desktop/Git/url-frontier/service$ java -Xmx2G -cp target/urlfrontier-service-*.jar crawlercommons.urlfrontier.service.URLFrontierServer implementation=crawlercommons.urlfrontier.service.rocksdb.ShardedRocksDBService nodes=2 read.thread.num=2 write.thread.num=4
17:06:46.830 [main] INFO  c.u.service.AbstractFrontierService - Available processor(s) 12
17:06:46.832 [main] INFO  c.u.service.AbstractFrontierService - Using 3 threads for reading from queues
17:06:46.833 [main] INFO  c.u.service.AbstractFrontierService - Using 3 threads for writing to queues
17:06:46.930 [main] INFO  c.u.service.AbstractFrontierService - Available processor(s) 12
17:06:46.930 [main] INFO  c.u.service.AbstractFrontierService - Using 2 threads for reading from queues
17:06:46.930 [main] INFO  c.u.service.AbstractFrontierService - Using 4 threads for writing to queues
17:06:46.930 [main] INFO  c.u.service.rocksdb.RocksDBService - RocksDB data stored in ./rocksdb 
17:06:47.139 [main] INFO  c.u.service.rocksdb.RocksDBService - RocksDB loaded in 207 msec
17:06:47.142 [main] INFO  c.u.service.rocksdb.RocksDBService - readQueueInfos read stats for 0 queues in 3 msec
17:06:47.142 [main] INFO  c.u.service.rocksdb.RocksDBService - Recovering queues from existing RocksDB
17:06:47.142 [main] INFO  c.u.service.rocksdb.RocksDBService - 0 queues discovered in 3 msec
17:06:47.143 [main] INFO  c.u.service.AbstractFrontierService - Node 0: 2
17:06:47.327 [main] INFO  c.u.service.URLFrontierServer - Started URLFrontierServer [ShardedRocksDBService] on port 7071 as localhost:7071

This is how it looks with the modification: The only impact is that the number of reading and writing threads differs (for one of the times when the executor services are instantiated).

michael@pc:~/Desktop/Git/url-frontier/service$ java -Xmx2G -cp target/urlfrontier-service-*.jar crawlercommons.urlfrontier.service.URLFrontierServer implementation=crawlercommons.urlfrontier.service.rocksdb.ShardedRocksDBService nodes=2 read.thread.num=2 write.thread.num=4
17:06:46.830 [main] INFO  c.u.service.AbstractFrontierService - Available processor(s) 12
17:06:46.832 [main] INFO  c.u.service.AbstractFrontierService - Using 2 threads for reading from queues
17:06:46.833 [main] INFO  c.u.service.AbstractFrontierService - Using 4 threads for writing to queues
17:06:46.930 [main] INFO  c.u.service.AbstractFrontierService - Available processor(s) 12
17:06:46.930 [main] INFO  c.u.service.AbstractFrontierService - Using 2 threads for reading from queues
17:06:46.930 [main] INFO  c.u.service.AbstractFrontierService - Using 4 threads for writing to queues
17:06:46.930 [main] INFO  c.u.service.rocksdb.RocksDBService - RocksDB data stored in ./rocksdb 
17:06:47.139 [main] INFO  c.u.service.rocksdb.RocksDBService - RocksDB loaded in 207 msec
17:06:47.142 [main] INFO  c.u.service.rocksdb.RocksDBService - readQueueInfos read stats for 0 queues in 3 msec
17:06:47.142 [main] INFO  c.u.service.rocksdb.RocksDBService - Recovering queues from existing RocksDB
17:06:47.142 [main] INFO  c.u.service.rocksdb.RocksDBService - 0 queues discovered in 3 msec
17:06:47.143 [main] INFO  c.u.service.AbstractFrontierService - Node 0: 2
17:06:47.327 [main] INFO  c.u.service.URLFrontierServer - Started URLFrontierServer [ShardedRocksDBService] on port 7071 as localhost:7071

One solution to get rid of the second instantiation of the executor services could be a useMultithreading flag in the constructor of AbstractFrontierService. This flag is set when DistributedFrontierService calls the constructor and it is not set when RocksDBService does so.

michaeldinzinger commented 1 year ago

Contained in code changes in PR #79