Closed anjackson closed 1 year ago
Other oddities. This works, listing URLs from a specific queue:
key = "anjackson.net",
crawlID = ""
but this returns nothing:
key = "anjackson.net",
crawlID = None
and so does this:
key = "",
crawlID = ""
but this returns all URLs
key = "",
crawlID = None,
i.e. when querying a particular queue, crawlID=""
matches the default crawl, but when no key is specified, this seems not to work, and it's necessary to use None
, which seems to be equivalent to the default crawl (rather than AnyCrawlID
).
I think maybe this code silently swallows all errors?
So there might be some underlying error that we can't see?
thanks for reporting these issues @anjackson
So there might be some underlying error that we can't see?
I think this section is about handling the errors reported by the client but we could indeed add a log there, wouldn't hurt
The latter exception is repeated many times.
Note that in the first few lines, the
Sent 0 from 0 queue(s) in 0 msec; tried 0 queues.
is also not accurate, as multiple URLs were viewed.
could be some discrepancies due to the recent introduction of the multithreaded reads/writes
i.e. when querying a particular queue,
crawlID=""
matches the default crawl, but when no key is specified, this seems not to work, and it's necessary to useNone
, which seems to be equivalent to the default crawl (rather thanAnyCrawlID
).
Setting an empty crawl ID is different from not sending one at all. The former is the equivalent of sending "DEFAULT" whereas the latter means "any crawl".
I agree this is maybe a bit opaque in the documentation. Any suggestions on how to improve it will be very welcome.
Querying a specific queue requires a crawl ID to be set, see https://github.com/crawler-commons/url-frontier/blob/1b6c2ec4b14cff24810c718103eca16c8fa17d48/service/src/main/java/crawlercommons/urlfrontier/service/AbstractFrontierService.java#L567
which is why
key = "anjackson.net",
crawlID = None
doesn't return anything.
However,
key = "",
crawlID = ""
is a bug which I will fix shortly (and add the logging for the errors reported by the client at the same time).
Thanks @anjackson
I'm testing against the latest version, and the above fix seems fine.
However, I'm still getting the same IllegalStateException
. Running exactly the same client query multiple times (pausing for a while in between), often I get the correct response, but every few calls it throws the Exception
and I get nothing back.
So, I backported this fix to version 2.2 locally. Running with that the system behaves as I expect, but with the latest master
I get intermittent IllegalStateException
errors.
thanks @anjackson. The IllegalStateExceptions are certainly related to the introduction of the multithreaded reads/ writes. I will try to reproduce the issue
@anjackson I haven't been able to reproduce the issue. Do you have a way of reproducing it e.g. list of URLs to inject and queries using the client? Thanks!
In one terminal I'm running the server (v2.3) via Docker. In another, I'm using a freshly compiled version of the reference client implementation. Once it's running, I create a test file containing a single URL, and do the following (waiting for a couple of seconds between running each command):
$ java -jar client/target/urlfrontier-client-2.4-SNAPSHOT.jar PutURLs -f client/test.txt
Sent: 1
Acked: 1
OK: 1
Skipped: 0
Failed: 0
Total time: 476 msec
Average OPS: 1
$ java -jar client/target/urlfrontier-client-2.4-SNAPSHOT.jar GetURLs -d 1
url: "http://example.com/"
key: "example.com"
crawlID: "DEFAULT"
Total time: 443 msec
$ java -jar client/target/urlfrontier-client-2.4-SNAPSHOT.jar GetURLs -d 1
url: "http://example.com/"
key: "example.com"
crawlID: "DEFAULT"
Total time: 410 msec
$ java -jar client/target/urlfrontier-client-2.4-SNAPSHOT.jar GetURLs -d 1
url: "http://example.com/"
key: "example.com"
crawlID: "DEFAULT"
Total time: 429 msec
$ java -jar client/target/urlfrontier-client-2.4-SNAPSHOT.jar GetURLs -d 1
Total time: 324 msec
The corresponding logging from the server looks like this:
$ docker compose up urlfrontier
[+] Running 1/0
⠿ Container scrapy-url-frontier-urlfrontier-1 Created 0.0s
Attaching to scrapy-url-frontier-urlfrontier-1
scrapy-url-frontier-urlfrontier-1 | 22:22:13.080 [main] INFO c.u.service.URLFrontierServer - Starting Prometheus server on port 9100
scrapy-url-frontier-urlfrontier-1 | 22:22:13.393 [main] INFO c.u.service.AbstractFrontierService - Available processor(s) 2
scrapy-url-frontier-urlfrontier-1 | 22:22:13.393 [main] INFO c.u.service.AbstractFrontierService - Using 1 threads for reading from queues
scrapy-url-frontier-urlfrontier-1 | 22:22:13.393 [main] INFO c.u.service.AbstractFrontierService - Using 1 threads for writing to queues
scrapy-url-frontier-urlfrontier-1 | 22:22:13.394 [main] INFO c.u.service.rocksdb.RocksDBService - RocksDB data stored in /storage/rocksdb
scrapy-url-frontier-urlfrontier-1 | 22:22:13.525 [main] INFO c.u.service.rocksdb.RocksDBService - RocksDB loaded in 123 msec
scrapy-url-frontier-urlfrontier-1 | 22:22:13.527 [main] INFO c.u.service.rocksdb.RocksDBService - readQueueInfos read stats for 0 queues in 2 msec
scrapy-url-frontier-urlfrontier-1 | 22:22:13.527 [main] INFO c.u.service.rocksdb.RocksDBService - Recovering queues from existing RocksDB
scrapy-url-frontier-urlfrontier-1 | 22:22:13.528 [main] INFO c.u.service.rocksdb.RocksDBService - 0 queues discovered in 4 msec
scrapy-url-frontier-urlfrontier-1 | 22:22:13.899 [main] INFO c.u.service.URLFrontierServer - Started URLFrontierServer [RocksDBService] on port 7071 as localhost:7071
scrapy-url-frontier-urlfrontier-1 | 22:24:56.786 [grpc-default-executor-0] INFO c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 1, max URLs 1, delay 1] e1500285-5af8-4145-afed-e96ca95a5a49
scrapy-url-frontier-urlfrontier-1 | 22:24:56.791 [grpc-default-executor-0] INFO c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 4 msec; tried 0 queues. e1500285-5af8-4145-afed-e96ca95a5a49
scrapy-url-frontier-urlfrontier-1 | 22:24:59.217 [grpc-default-executor-0] INFO c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 1, max URLs 1, delay 1] 0868f6cb-ee05-4733-98a7-02d04991f3bc
scrapy-url-frontier-urlfrontier-1 | 22:24:59.222 [grpc-default-executor-0] INFO c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 5 msec; tried 0 queues. 0868f6cb-ee05-4733-98a7-02d04991f3bc
scrapy-url-frontier-urlfrontier-1 | 22:25:01.627 [grpc-default-executor-0] INFO c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 1, max URLs 1, delay 1] ea64d1db-2f5c-4939-8297-d2e4d3a2cbc3
scrapy-url-frontier-urlfrontier-1 | 22:25:01.629 [grpc-default-executor-0] INFO c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 2 msec; tried 0 queues. ea64d1db-2f5c-4939-8297-d2e4d3a2cbc3
scrapy-url-frontier-urlfrontier-1 | 22:25:04.142 [grpc-default-executor-0] INFO c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 1, max URLs 1, delay 1] 70b96a6c-8b8c-4321-bbc6-d9b7de8c1805
scrapy-url-frontier-urlfrontier-1 | 22:25:04.143 [grpc-default-executor-0] INFO c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 1 msec; tried 0 queues. 70b96a6c-8b8c-4321-bbc6-d9b7de8c1805
scrapy-url-frontier-urlfrontier-1 | 22:25:04.145 [pool-2-thread-1] ERROR c.u.service.rocksdb.RocksDBService - Caught unlikely error
scrapy-url-frontier-urlfrontier-1 | java.lang.IllegalStateException: Stream is already completed, no further calls are allowed
scrapy-url-frontier-urlfrontier-1 | at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
scrapy-url-frontier-urlfrontier-1 | at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:375)
scrapy-url-frontier-urlfrontier-1 | at crawlercommons.urlfrontier.service.SynchronizedStreamObserver.onNext(SynchronizedStreamObserver.java:59)
scrapy-url-frontier-urlfrontier-1 | at crawlercommons.urlfrontier.service.rocksdb.RocksDBService.sendURLsForQueue(RocksDBService.java:338)
scrapy-url-frontier-urlfrontier-1 | at crawlercommons.urlfrontier.service.AbstractFrontierService.lambda$getURLs$0(AbstractFrontierService.java:681)
scrapy-url-frontier-urlfrontier-1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
scrapy-url-frontier-urlfrontier-1 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
scrapy-url-frontier-urlfrontier-1 | at java.base/java.lang.Thread.run(Thread.java:829)
i.e. it says Sent 0
when it's sent one, and the fourth call throws an exception.
Hmm, weird. Can't reproduce that when running the latest server directly (rather than using 2.3 in Docker). I'll try a couple more things but it's very odd that Docker seems to be getting in the way?
I compiled the 2.3 tag and running the server directly I could reproduce the issue (although it took around 10 calls before the error showed up). Going back to master
and re-running and the error goes away. but the only thing that changed was your original fix for this issue which seems to have made it go away somehow, but I apparently hadn't tested that properly before?
Okay, building a Docker image of the master branch to try locally...
Running the same version under Docker and now the error is easy to reproduce?! Then going back to running the server directly, if I repeat the query for long enough, it happens there too. e.g.
$ watch -n 2 java -jar client/target/urlfrontier-client-2.4-SNAPSHOT.jar GetURLs -d 1
So, it seems like it's more likely to happen if running over Docker, but still happens when running the server directly.
Doesn't seem to matter if I'm running over the host network or a Docker network. Doesn't seem to matter if I'm running on a RAM disk or local drive. It still throws that error from time to time.
EDIT Seems to happen a bit more if I run two clients in parallel.
thanks @anjackson, I managed to reproduce the issue by setting the number of threads to 1
java -Xmx2G -cp target/urlfrontier-service-*.jar crawlercommons.urlfrontier.service.URLFrontierServer read.thread.num=1 write.thread.num=1
and using a single URL for testing.
I have pushed a fix to prevent a premature termination of the stream; it also fixes the counts of URLs sent. Could you please give it a try?
Thanks!
Yes, that seems to be working very well now. Thank you!
thanks @anjackson I'll push a new release so that these fixes can be used asap
Perfect! Thanks again.
I'm using this client to talk to a URLFronter v2.3 service running via Docker on an Ubuntu OS. I'm attempting to write a command line tool that queries the frontier to list URLs, but without interfering too much with the crawl by setting
delay_requestable=1
. Sometimes it works as expected, other times no results come back, and I see this in the URLFrontier service logs:The latter exception is repeated many times.
Note that in the first few lines, the
Sent 0 from 0 queue(s) in 0 msec; tried 0 queues.
is also not accurate, as multiple URLs were viewed.