Clustering fails with divide by 0 error from the database

CygnusAlpha commented 7 years ago

[removed messed up log... It's uploaded below]

Which looks like this is the culprit:

"res"."tinyhash" % 0 =

So, probably the cluster thinks there are 0 nodes.

Commit e8331cb, probably result of: e13639d - Perform dequeue operation within a transaction..

Internal tracking: RESDATA-1129

nevali commented 7 years ago

Can you attach the full run log from start to finish, and a copy of your config with any credentials blanked out?

nevali commented 7 years ago

…because whatever generated that is hateful

nevali commented 7 years ago

also are you using etcd as suggested in that note, or are you just configuring it statically? because the latter would be much easier, I'd imagine. we're not using etcd in production at all.

CygnusAlpha commented 7 years ago

anansi-cluster-log.txt

crawl.conf.txt

Steps: git clone git@repo.ch.internal:oss/acropolis.git --recursive cd acropolis git checkout 6d88b7d docker-compose up docker-compose scale anansi=2 docker-compose run anansi crawler-add http://sws.geonames.org/2643743/

It's possible (quite likely) that I misconfigured it or am running it wrongly of course.

CygnusAlpha commented 7 years ago

"also are you using etcd as suggested in that note, or are you just configuring it statically? because the latter would be much easier, I'd imagine. we're not using etcd in production at all."

I'm second guessing how to configure it. I found limited documentation about it and this was based on what was in anansi/crawler/crawl.conf.

A proper config example mirroring whats on live would be good.

nevali commented 7 years ago

it's configurable precisely because what's useful on live and what's useful for day-to-day development are generally not the same

nevali commented 7 years ago

what live does though, is to give it a cluster registry URI that matches the queue database URI (for the moment, it will change again on AWS to use a different database).

nevali commented 7 years ago

Oh, actually https://github.com/bbcarchdev/anansi/wiki

CygnusAlpha commented 7 years ago

I have configured the crawld so: (getting rid of etcd)

  1 [crawler]
  2 detach=no
  3 verbose=no
  4 threads=1
  5
  6 [cluster]
  7 name=anansi
  8 registry=pgsql://postgres:postgres@postgres/anansi
  9 environment=development
 10
 11
 12 [processor]

And run a second crawld. ( dc scale anansi=2 )

Here is the log which eventually aborts with:

anansi_1         | crawld[1]: processor_handler: following 303 redirect to <http://dbpedia.org/data/Coopers_School.xml>
anansi_1         | crawld[1]: Adding URI <http://dbpedia.org/data/Coopers_School.xml> to crawler queue
anansi_1         | crawld[1]: libcluster: SQL: this instance is no longer a member of anansi/development
anansi_1         | crawld[1]: libcluster: re-balanced; this instance has base index -1 (1 workers) from a total of 0
anansi_1         | crawld[1]: %ANANSI-N-2011: cluster has re-balanced: instance faac917d46c047d294d095da703ce3b7 has left cluster anansi/development
anansi_1         | crawld[1]: %ANANSI-N-2030: crawl thread suspended due to re-balancing [development] crawler 2/2 (thread 1/1)
anansi_1         | crawld[1]: %ANSNSI-E-5005: SQL error [22012]: ERROR:  division by zero
anansi_1         |

anansi_log_static_cluster.txt

nevali commented 7 years ago

aha

the

anansi_1         | crawld[1]: %ANANSI-N-2030: crawl thread suspended due to re-balancing [development] crawler 2/2 (thread 1/1)

line looks key - it would seem there's a race-condition there, where the instance ID is being changed in the crawl context itself before it has a chance to suspend and not use it any more.

nevali commented 7 years ago

that said, unless a signal's been received, this shouldn't happen:

anansi_1         | crawld[1]: libcluster: SQL: this instance is no longer a member of anansi/development

it'd be good to see the SQL query which triggered this; given we're trying to diagnose wtf is going on, I'd set verbose=yes in both the [crawler] and [cluster] sections

bbcarchdev / anansi

Clustering fails with divide by 0 error from the database #69