bbcarchdev / spindle

RES Linked Open Data aggregation engine
https://bbcarchdev.github.io/spindle/
Apache License 2.0
2 stars 1 forks source link

If the connection to S3 is lost URIs are marked as REJECTED and not re-visited #81

Open cgueret opened 8 years ago

cgueret commented 8 years ago

If the connection to S3 is lost during the processing of the queue the URIs are marked as REJECTED and not re-visited. The ingest of the data set is then never "complete":

screen shot 2016-07-12 at 12 33 56

In the DB, the REJECTED URIs looks like:

 0e519fdb-4e0b-4fbd-b1f7-7a3ddae15b95 |  240230363 |      219 | REJECTED | 2016-07-11 14:33:10 |     0
 36d4bede-4109-4d82-b6ab-a6e2a5d8d5f9 |  919912158 |      222 | REJECTED | 2016-07-11 14:33:09 |     0
 6f2cb395-102e-44cc-ae80-627dddca037a | 1865200533 |      149 | REJECTED | 2016-07-11 14:33:08 |     0
 780d6ccd-d2f6-4ad4-88c7-80fd9003e7eb | 2014145741 |      205 | REJECTED | 2016-07-11 14:33:07 |     0
 763bfc27-40e1-481c-8e6f-6e3020159505 | 1983642663 |       39 | REJECTED | 2016-07-11 14:33:06 |     0
 7abf4218-44b7-41ab-b7f3-b466eae639bd | 2059354648 |       24 | REJECTED | 2016-07-11 14:39:04 |     0
 f96ca0b2-6f4f-4648-b790-624007342e8d | 4184645810 |      178 | REJECTED | 2016-07-11 14:32:26 |     0
 e6a23080-eb25-4e69-a983-17940514186e | 3869388928 |      128 | REJECTED | 2016-07-11 14:38:10 |     0
 59cc47c3-d31a-4377-a405-04ff0b6d7e2a | 1506559939 |      195 | REJECTED | 2016-07-11 14:32:26 |     0
 f45a9d36-57ee-49c3-8291-1a8d28802418 | 4099579190 |       54 | REJECTED | 2016-07-11 14:39:19 |     0
 6caca1cd-0fe2-4531-bbd7-034bb91fd1b3 | 1823252941 |      205 | REJECTED | 2016-07-11 14:32:22 |     0
 871ba237-99c7-47f1-979a-f796fb14ca5f | 2266735159 |       55 | REJECTED | 2016-07-11 14:32:08 |     0
nevali commented 8 years ago

The simplest fix is probably to do three things:

  1. Apply a schema update to add an index on the modified column of the state table.
  2. Modify spindle_mqmessage_reject_() to update the modified column.
  3. Execute an additional query in the libmq implementation (i.e., in spindle_mq_next_(), prior to attempting the current one) which does something like SELECT "id" FROM "state" WHERE "status" = 'REJECTED' AND "tinyhash" % nodecount = nodeid AND "modified" <= cutoff, where cutoff is a timestamp 24 hours in the past (or better, a configurable value, which would mean you could ask Twine to re-process all rejected items by specifying the cutoff as 0 on the command-line).