crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers
Apache License 2.0
44 stars 11 forks source link

Some calls to GetURLs failing with java.lang.IllegalStateException #71

Closed anjackson closed 1 year ago

anjackson commented 1 year ago

I'm using this client to talk to a URLFronter v2.3 service running via Docker on an Ubuntu OS. I'm attempting to write a command line tool that queries the frontier to list URLs, but without interfering too much with the crawl by setting delay_requestable=1. Sometimes it works as expected, other times no results come back, and I see this in the URLFrontier service logs:

scrapy-url-frontier-urlfrontier-1  | 17:33:54.999 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 2147483647, max URLs 2147483647, delay 5] 53a5f051-bb28-4e7b-954a-43bb5e626e80
scrapy-url-frontier-urlfrontier-1  | 17:33:55.000 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 0 msec; tried 0 queues. 53a5f051-bb28-4e7b-954a-43bb5e626e80
scrapy-url-frontier-urlfrontier-1  | 17:33:56.032 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 2147483647, max URLs 2147483647, delay 5] 52b9dec0-5df4-40e0-ac7a-0e54c00dd5bc
scrapy-url-frontier-urlfrontier-1  | 17:33:56.032 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 0 msec; tried 0 queues. 52b9dec0-5df4-40e0-ac7a-0e54c00dd5bc
^[[A
scrapy-url-frontier-urlfrontier-1  | 17:34:05.119 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 2147483647, max URLs 2147483647, delay 5] 0a5d48e3-d4e6-49d4-ba86-83cba5e59fda
scrapy-url-frontier-urlfrontier-1  | 17:34:05.120 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 0 msec; tried 0 queues. 0a5d48e3-d4e6-49d4-ba86-83cba5e59fda
scrapy-url-frontier-urlfrontier-1  | 17:34:07.414 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 2147483647, max URLs 2147483647, delay 5] 22c7f96a-dec4-43f6-ae99-d85dec75013a
scrapy-url-frontier-urlfrontier-1  | 17:34:07.415 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 0 msec; tried 0 queues. 22c7f96a-dec4-43f6-ae99-d85dec75013a
scrapy-url-frontier-urlfrontier-1  | 17:34:10.814 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 2147483647, max URLs 2147483647, delay 5] 0dc31afb-e60c-4f37-82c4-6d0c8663f297
scrapy-url-frontier-urlfrontier-1  | 17:34:10.815 [grpc-default-executor-4] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 1 msec; tried 0 queues. 0dc31afb-e60c-4f37-82c4-6d0c8663f297
scrapy-url-frontier-urlfrontier-1  | 17:34:10.816 [pool-2-thread-1] ERROR c.u.service.rocksdb.RocksDBService - Caught unlikely error 
scrapy-url-frontier-urlfrontier-1  | java.lang.IllegalStateException: Stream is already completed, no further calls are allowed
scrapy-url-frontier-urlfrontier-1  |    at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
scrapy-url-frontier-urlfrontier-1  |    at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:375)
scrapy-url-frontier-urlfrontier-1  |    at crawlercommons.urlfrontier.service.SynchronizedStreamObserver.onNext(SynchronizedStreamObserver.java:59)
scrapy-url-frontier-urlfrontier-1  |    at crawlercommons.urlfrontier.service.rocksdb.RocksDBService.sendURLsForQueue(RocksDBService.java:338)
scrapy-url-frontier-urlfrontier-1  |    at crawlercommons.urlfrontier.service.AbstractFrontierService.lambda$getURLs$0(AbstractFrontierService.java:681)
scrapy-url-frontier-urlfrontier-1  |    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
scrapy-url-frontier-urlfrontier-1  |    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
scrapy-url-frontier-urlfrontier-1  |    at java.base/java.lang.Thread.run(Thread.java:829)
scrapy-url-frontier-urlfrontier-1  | 17:34:10.816 [pool-2-thread-1] ERROR c.u.service.rocksdb.RocksDBService - Caught unlikely error 
scrapy-url-frontier-urlfrontier-1  | java.lang.IllegalStateException: Stream is already completed, no further calls are allowed
scrapy-url-frontier-urlfrontier-1  |    at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
scrapy-url-frontier-urlfrontier-1  |    at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:375)
scrapy-url-frontier-urlfrontier-1  |    at crawlercommons.urlfrontier.service.SynchronizedStreamObserver.onNext(SynchronizedStreamObserver.java:59)
scrapy-url-frontier-urlfrontier-1  |    at crawlercommons.urlfrontier.service.rocksdb.RocksDBService.sendURLsForQueue(RocksDBService.java:338)
scrapy-url-frontier-urlfrontier-1  |    at crawlercommons.urlfrontier.service.AbstractFrontierService.lambda$getURLs$0(AbstractFrontierService.java:681)
scrapy-url-frontier-urlfrontier-1  |    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
scrapy-url-frontier-urlfrontier-1  |    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
scrapy-url-frontier-urlfrontier-1  |    at java.base/java.lang.Thread.run(Thread.java:829)

The latter exception is repeated many times.

Note that in the first few lines, the Sent 0 from 0 queue(s) in 0 msec; tried 0 queues. is also not accurate, as multiple URLs were viewed.

anjackson commented 1 year ago

Other oddities. This works, listing URLs from a specific queue:

                key = "anjackson.net",
                crawlID = ""

but this returns nothing:

                key = "anjackson.net",
                crawlID = None

and so does this:

                key = "",
                crawlID = ""

but this returns all URLs

                key = "",
                crawlID = None,

i.e. when querying a particular queue, crawlID="" matches the default crawl, but when no key is specified, this seems not to work, and it's necessary to use None, which seems to be equivalent to the default crawl (rather than AnyCrawlID).

anjackson commented 1 year ago

I think maybe this code silently swallows all errors?

https://github.com/crawler-commons/url-frontier/blob/1b6c2ec4b14cff24810c718103eca16c8fa17d48/service/src/main/java/crawlercommons/urlfrontier/service/AbstractFrontierService.java#L802-L813

So there might be some underlying error that we can't see?

jnioche commented 1 year ago

thanks for reporting these issues @anjackson

So there might be some underlying error that we can't see?

I think this section is about handling the errors reported by the client but we could indeed add a log there, wouldn't hurt

jnioche commented 1 year ago

The latter exception is repeated many times.

Note that in the first few lines, the Sent 0 from 0 queue(s) in 0 msec; tried 0 queues. is also not accurate, as multiple URLs were viewed.

could be some discrepancies due to the recent introduction of the multithreaded reads/writes

jnioche commented 1 year ago

i.e. when querying a particular queue, crawlID="" matches the default crawl, but when no key is specified, this seems not to work, and it's necessary to use None, which seems to be equivalent to the default crawl (rather than AnyCrawlID).

Setting an empty crawl ID is different from not sending one at all. The former is the equivalent of sending "DEFAULT" whereas the latter means "any crawl".

I agree this is maybe a bit opaque in the documentation. Any suggestions on how to improve it will be very welcome.

Querying a specific queue requires a crawl ID to be set, see https://github.com/crawler-commons/url-frontier/blob/1b6c2ec4b14cff24810c718103eca16c8fa17d48/service/src/main/java/crawlercommons/urlfrontier/service/AbstractFrontierService.java#L567

which is why

                key = "anjackson.net",
                crawlID = None

doesn't return anything.

However,

                key = "",
                crawlID = ""

is a bug which I will fix shortly (and add the logging for the errors reported by the client at the same time).

Thanks @anjackson

anjackson commented 1 year ago

I'm testing against the latest version, and the above fix seems fine.

However, I'm still getting the same IllegalStateException. Running exactly the same client query multiple times (pausing for a while in between), often I get the correct response, but every few calls it throws the Exception and I get nothing back.

anjackson commented 1 year ago

So, I backported this fix to version 2.2 locally. Running with that the system behaves as I expect, but with the latest master I get intermittent IllegalStateException errors.

jnioche commented 1 year ago

thanks @anjackson. The IllegalStateExceptions are certainly related to the introduction of the multithreaded reads/ writes. I will try to reproduce the issue

jnioche commented 1 year ago

@anjackson I haven't been able to reproduce the issue. Do you have a way of reproducing it e.g. list of URLs to inject and queries using the client? Thanks!

anjackson commented 1 year ago

In one terminal I'm running the server (v2.3) via Docker. In another, I'm using a freshly compiled version of the reference client implementation. Once it's running, I create a test file containing a single URL, and do the following (waiting for a couple of seconds between running each command):

$ java -jar client/target/urlfrontier-client-2.4-SNAPSHOT.jar PutURLs -f client/test.txt 
Sent: 1
Acked: 1
OK: 1
Skipped: 0
Failed: 0
Total time: 476 msec
Average OPS: 1
$ java -jar client/target/urlfrontier-client-2.4-SNAPSHOT.jar GetURLs -d 1
url: "http://example.com/"
key: "example.com"
crawlID: "DEFAULT"

Total time: 443 msec
$ java -jar client/target/urlfrontier-client-2.4-SNAPSHOT.jar GetURLs -d 1
url: "http://example.com/"
key: "example.com"
crawlID: "DEFAULT"

Total time: 410 msec
$ java -jar client/target/urlfrontier-client-2.4-SNAPSHOT.jar GetURLs -d 1
url: "http://example.com/"
key: "example.com"
crawlID: "DEFAULT"

Total time: 429 msec
$ java -jar client/target/urlfrontier-client-2.4-SNAPSHOT.jar GetURLs -d 1
Total time: 324 msec

The corresponding logging from the server looks like this:

$ docker compose up urlfrontier 
[+] Running 1/0
 ⠿ Container scrapy-url-frontier-urlfrontier-1  Created                                                                                0.0s
Attaching to scrapy-url-frontier-urlfrontier-1
scrapy-url-frontier-urlfrontier-1  | 22:22:13.080 [main] INFO  c.u.service.URLFrontierServer - Starting Prometheus server on port 9100
scrapy-url-frontier-urlfrontier-1  | 22:22:13.393 [main] INFO  c.u.service.AbstractFrontierService - Available processor(s) 2
scrapy-url-frontier-urlfrontier-1  | 22:22:13.393 [main] INFO  c.u.service.AbstractFrontierService - Using 1 threads for reading from queues
scrapy-url-frontier-urlfrontier-1  | 22:22:13.393 [main] INFO  c.u.service.AbstractFrontierService - Using 1 threads for writing to queues
scrapy-url-frontier-urlfrontier-1  | 22:22:13.394 [main] INFO  c.u.service.rocksdb.RocksDBService - RocksDB data stored in /storage/rocksdb 
scrapy-url-frontier-urlfrontier-1  | 22:22:13.525 [main] INFO  c.u.service.rocksdb.RocksDBService - RocksDB loaded in 123 msec
scrapy-url-frontier-urlfrontier-1  | 22:22:13.527 [main] INFO  c.u.service.rocksdb.RocksDBService - readQueueInfos read stats for 0 queues in 2 msec
scrapy-url-frontier-urlfrontier-1  | 22:22:13.527 [main] INFO  c.u.service.rocksdb.RocksDBService - Recovering queues from existing RocksDB
scrapy-url-frontier-urlfrontier-1  | 22:22:13.528 [main] INFO  c.u.service.rocksdb.RocksDBService - 0 queues discovered in 4 msec
scrapy-url-frontier-urlfrontier-1  | 22:22:13.899 [main] INFO  c.u.service.URLFrontierServer - Started URLFrontierServer [RocksDBService] on port 7071 as localhost:7071
scrapy-url-frontier-urlfrontier-1  | 22:24:56.786 [grpc-default-executor-0] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 1, max URLs 1, delay 1] e1500285-5af8-4145-afed-e96ca95a5a49
scrapy-url-frontier-urlfrontier-1  | 22:24:56.791 [grpc-default-executor-0] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 4 msec; tried 0 queues. e1500285-5af8-4145-afed-e96ca95a5a49
scrapy-url-frontier-urlfrontier-1  | 22:24:59.217 [grpc-default-executor-0] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 1, max URLs 1, delay 1] 0868f6cb-ee05-4733-98a7-02d04991f3bc
scrapy-url-frontier-urlfrontier-1  | 22:24:59.222 [grpc-default-executor-0] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 5 msec; tried 0 queues. 0868f6cb-ee05-4733-98a7-02d04991f3bc
scrapy-url-frontier-urlfrontier-1  | 22:25:01.627 [grpc-default-executor-0] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 1, max URLs 1, delay 1] ea64d1db-2f5c-4939-8297-d2e4d3a2cbc3
scrapy-url-frontier-urlfrontier-1  | 22:25:01.629 [grpc-default-executor-0] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 2 msec; tried 0 queues. ea64d1db-2f5c-4939-8297-d2e4d3a2cbc3
scrapy-url-frontier-urlfrontier-1  | 22:25:04.142 [grpc-default-executor-0] INFO  c.u.service.AbstractFrontierService - Received request to get fetchable URLs [max queues 1, max URLs 1, delay 1] 70b96a6c-8b8c-4321-bbc6-d9b7de8c1805
scrapy-url-frontier-urlfrontier-1  | 22:25:04.143 [grpc-default-executor-0] INFO  c.u.service.AbstractFrontierService - Sent 0 from 0 queue(s) in 1 msec; tried 0 queues. 70b96a6c-8b8c-4321-bbc6-d9b7de8c1805
scrapy-url-frontier-urlfrontier-1  | 22:25:04.145 [pool-2-thread-1] ERROR c.u.service.rocksdb.RocksDBService - Caught unlikely error 
scrapy-url-frontier-urlfrontier-1  | java.lang.IllegalStateException: Stream is already completed, no further calls are allowed
scrapy-url-frontier-urlfrontier-1  |    at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
scrapy-url-frontier-urlfrontier-1  |    at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:375)
scrapy-url-frontier-urlfrontier-1  |    at crawlercommons.urlfrontier.service.SynchronizedStreamObserver.onNext(SynchronizedStreamObserver.java:59)
scrapy-url-frontier-urlfrontier-1  |    at crawlercommons.urlfrontier.service.rocksdb.RocksDBService.sendURLsForQueue(RocksDBService.java:338)
scrapy-url-frontier-urlfrontier-1  |    at crawlercommons.urlfrontier.service.AbstractFrontierService.lambda$getURLs$0(AbstractFrontierService.java:681)
scrapy-url-frontier-urlfrontier-1  |    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
scrapy-url-frontier-urlfrontier-1  |    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
scrapy-url-frontier-urlfrontier-1  |    at java.base/java.lang.Thread.run(Thread.java:829)

i.e. it says Sent 0 when it's sent one, and the fourth call throws an exception.

anjackson commented 1 year ago

Hmm, weird. Can't reproduce that when running the latest server directly (rather than using 2.3 in Docker). I'll try a couple more things but it's very odd that Docker seems to be getting in the way?

anjackson commented 1 year ago

I compiled the 2.3 tag and running the server directly I could reproduce the issue (although it took around 10 calls before the error showed up). Going back to master and re-running and the error goes away. but the only thing that changed was your original fix for this issue which seems to have made it go away somehow, but I apparently hadn't tested that properly before?

Okay, building a Docker image of the master branch to try locally...

anjackson commented 1 year ago

Running the same version under Docker and now the error is easy to reproduce?! Then going back to running the server directly, if I repeat the query for long enough, it happens there too. e.g.

$ watch -n 2 java -jar client/target/urlfrontier-client-2.4-SNAPSHOT.jar GetURLs -d 1

So, it seems like it's more likely to happen if running over Docker, but still happens when running the server directly.

anjackson commented 1 year ago

Doesn't seem to matter if I'm running over the host network or a Docker network. Doesn't seem to matter if I'm running on a RAM disk or local drive. It still throws that error from time to time.

EDIT Seems to happen a bit more if I run two clients in parallel.

jnioche commented 1 year ago

thanks @anjackson, I managed to reproduce the issue by setting the number of threads to 1

java -Xmx2G -cp target/urlfrontier-service-*.jar crawlercommons.urlfrontier.service.URLFrontierServer read.thread.num=1 write.thread.num=1

and using a single URL for testing.

I have pushed a fix to prevent a premature termination of the stream; it also fixes the counts of URLs sent. Could you please give it a try?

Thanks!

anjackson commented 1 year ago

Yes, that seems to be working very well now. Thank you!

jnioche commented 1 year ago

thanks @anjackson I'll push a new release so that these fixes can be used asap

anjackson commented 1 year ago

Perfect! Thanks again.