Open Voyewodayer opened 1 year ago
Can you dump the master node docker logs?
Below there are logs from master node from most recent such occurence. Logs are timestamped between 1 minute before it happened to 1 minute after. Crawlab container did rebuild itself at exactly 2023-09-13T16:30:47.886726955Z. Around the same time, as we can see on line 15 of the logs, panic: runtime error: index out of range [243] with length 62
error happened. Afterwards service did reset.
@tikazyq It is the same error on the pro version, and it happens frequently
I probably managed to pinpoint that it tends to happen while API call to /metrics endpoint is being made. Shortly after it tends to crash the container and rebuild it again with errors akin to the logs provided in my previous comment.
@tikazyq same issue on pro version
@tikazyq hi, any progress on this task?
You can switch to the latest version of docker image to resolve this issue.
@tikazyq hey, we've updated images to crawlab-pro:latest
but still we have this abnormal
issue.
I've noticed that the last modifications on crawlab-pro:latest
was made 5 month ago and on crawlab-pro:develop
4 month ago. Can you please verify that fix of abnormal
issue is on crawlab-pro:latest
and not on crawlab-pro:develop
?
@tikazyq hi again, do you have any update to the above message?
@tikazyq Hi , could you tell how to disable the metrics flag
@tikazyq Hi , could you tell how to disable the metrics flag
We have observed that there is some kind of co-relation between the metrics
and the abnormal
status. As long as the metrics are off there is no issue, but some how it gets turned on automatically and all of a sudden the abnormal
status starts popping up.
It would be really helpful to know of a way to disable it.
@tikazyq please answer us. The same problem with the same thing.
Describe the bug While executing a task that runs for quite a long time (>1 day) docker logs tend to show various errors related to GRPC connection such as
rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp <MASTER_NODE_IP_ADDRESS>:<GRPC_PORT>: connect: connection refused
orconnection reset by peer
. Both worker node on which task runs and master node machines show continuous, uninterrupted availability network and hardware wise. Sometimes it is accompanied by errors related to failure of verifying licenseerror verify license error: Post \"https://license.crawlab.cn/release/license/verify\": net/http: TLS handshake timeout. retry in 5 seconds
.Expected behavior Nodes don't lose GRPC connection randomly