RIPE-NCC / rpki-validator-3

RIPE NCC RPKI Validator 3
Other
63 stars 27 forks source link

3.0-394 can't complete a boot? #77

Closed job closed 5 years ago

job commented 5 years ago

Using rpki-validator-3.0-394-dist.tar.gz (md5sum fafb95e829a5d5f5905b748bbd95e2dc).

Steps to reproduce:

root@rpki:~# tar fxz rpki-validator-3.0-394-dist.tar.gz
root@rpki:~# cp rpki-validator-3.0-376/preconfigured-tals/arin.tal rpki-validator-3.0-394/preconfigured-tals/
root@rpki:~# cd rpki-validator-3.0-394
root@rpki:~/rpki-validator-3.0-394# ./rpki-validator-3.sh start &

Within a few minutes it'll have rsynced most of the data:

root@rpki:~/rpki-validator-3.0-394/rsync# du -sh *
27M repository.lacnic.net
3.2M    rpki.afrinic.net
36M rpki.apnic.net
27M rpki.arin.net
16K rpki.ripe.net

The web interface now shows one TAL operational (APNIC): https://rpki.meerval.net/trust-anchors

The log spits out lots of this:

2019-03-19 11:32:53,590      ERROR   quartzScheduler_Worker-10           o.h.e.j.s.SqlExceptionHelper        Unique index or primary key violation: "RPKI_OBJECT__SHA256_IDX ON PUBLIC.RPKI_OBJECT(SHA256) VALUES (X'29b04e4eb4ef2a79744d0ab2ff8ab6e8e8ef1cbbcc67df4139f5ac739aec126d', 85542)"; SQL statement:
/* insert net.ripe.rpki.validator3.domain.RpkiObject */ insert into rpki_object (created_at, updated_at, version, authority_key_identifier, last_marked_reachable_at, serial_number, sha256, signing_time, type, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) [23505-197]
2019-03-19 11:33:52,865      WARN    quartzScheduler_Worker-6            o.h.e.j.s.SqlExceptionHelper        SQL Error: 23505, SQLState: 23505
2019-03-19 11:33:52,873      ERROR   quartzScheduler_Worker-6            o.h.e.j.s.SqlExceptionHelper        Unique index or primary key violation: "RPKI_OBJECT__SHA256_IDX ON PUBLIC.RPKI_OBJECT(SHA256) VALUES (X'29b04e4eb4ef2a79744d0ab2ff8ab6e8e8ef1cbbcc67df4139f5ac739aec126d', 85542)"; SQL statement:
/* insert net.ripe.rpki.validator3.domain.RpkiObject */ insert into rpki_object (created_at, updated_at, version, authority_key_identifier, last_marked_reachable_at, serial_number, sha256, signing_time, type, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) [23505-197]

2 hours later the RPKI validator still hasn't booted beyond the APNIC tal

job commented 5 years ago

I can give ssh access to the server that runs this software if RIPE NCC can't replicate

FvDxxx commented 5 years ago

You are not alone. Well, I got all except RIPE up after some time, but RIPE remains down on one box. IIRC I once had similar issues on other boxes and got that fixed by wiping out the DB and rsync directory, but this time it doesn't seem to help (and for now I'm not touching the others).

2019-03-19_13h10_22

Mar 19 12:01:05 rpki-rtr-server.sh: 2019-03-19 12:01:05.881  WARN 20200 --- [eduler_Worker-2] n.r.r.r.a.v.RefreshCacheController       : validator http://localhost:8080/api/objects/validated not ready yet, will retry later
Mar 19 12:01:25 rpki-validator-3.sh: 2019-03-19 12:01:25,425 #011 WARN  #011 quartzScheduler_Worker-9 #011#011#011 o.h.e.j.s.SqlExceptionHelper #011#011 SQL Error: 23505, SQLState: 23505
Mar 19 12:01:25 rpki-validator-3.sh: 2019-03-19 12:01:25,425 #011 ERROR #011 quartzScheduler_Worker-9 #011#011#011 o.h.e.j.s.SqlExceptionHelper #011#011 Unique index or primary key violation: "RPKI_OBJECT__SHA256_IDX ON PUBLIC.RPKI_OBJECT(SHA256) VALUES (X'020218fa6a03578506d1338e39474bc4a8c8ddf5530853d01730182d1953787f', 260064)"; SQL statement:
Mar 19 12:01:25 rpki-validator-3.sh: /* insert net.ripe.rpki.validator3.domain.RpkiObject */ insert into rpki_object (created_at, updated_at, version, authority_key_identifier, last_marked_reachable_at, serial_number, sha256, signing_time, type, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) [23505-197]
Mar 19 12:01:40 rpki-validator-3.sh: 2019-03-19 12:01:40,438 #011 WARN  #011 quartzScheduler_Worker-4 #011#011#011 o.h.e.j.s.SqlExceptionHelper #011#011 SQL Error: 23505, SQLState: 23505
Mar 19 12:01:40 rpki-validator-3.sh: 2019-03-19 12:01:40,438 #011 ERROR #011 quartzScheduler_Worker-4 #011#011#011 o.h.e.j.s.SqlExceptionHelper #011#011 Unique index or primary key violation: "RPKI_OBJECT__SHA256_IDX ON PUBLIC.RPKI_OBJECT(SHA256) VALUES (X'b48be5a31f6c8ac276d7f70f31f9a22d0a44580b40583a606983255ced6525c6', 215964)"; SQL statement:
Mar 19 12:01:40 rpki-validator-3.sh: /* insert net.ripe.rpki.validator3.domain.RpkiObject */ insert into rpki_object (created_at, updated_at, version, authority_key_identifier, last_marked_reachable_at, serial_number, sha256, signing_time, type, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) [23505-197]

I vague recall someone once mentioned that the SQL primary key violation errors could be safely ignored … can't they?

Nusenu also reported on routing-wg, that >370 slowed down rsync operations (esp. with RIPE). Not sure if that anyhow links to this.

lolepezy commented 5 years ago

It's been a performance regression, fixed in build 397. The unique key violation problem depends on the state of a repository and in principle can be ignored, but we are working on eliminating it completely.

job commented 5 years ago

I loaded rpki-validator-3.0-397 - similar issue. After 30 minutes only APNIC & RIPE are loaded. Seems to have stalled somewhere in processing the other 3 repositories.

d4nys3k commented 5 years ago

I can confirm this, after DB cleanup (https://github.com/RIPE-NCC/rpki-validator-3/issues/61#issuecomment-475523608) I observe similar issue with 3.0-397, after clean start only RIPE and APNIC TALs are loaded after more than 30 minutes, other repositories are stuck in pending state.

In logs, I also see following errors (repeatedly):

Mar 22 09:05:11 rpki rpki-validator-3.sh: 2019-03-22 09:05:11,152 #011 WARN  #011 quartzScheduler_Worker-8 #011#011#011 o.h.e.j.s.SqlExceptionHelper #011#011 SQL Error: 23505, SQLState: 23505
Mar 22 09:05:11 rpki rpki-validator-3.sh: 2019-03-22 09:05:11,152 #011 ERROR #011 quartzScheduler_Worker-8 #011#011#011 o.h.e.j.s.SqlExceptionHelper #011#011 Unique index or primary key violation: "RPKI_OBJECT__SHA256_IDX ON PUBLIC.RPKI_OBJECT(SHA256) VALUES (X'e2c6bd586f389e206a8ad6748943982ff8018106da4e4443d1d2a62c949e9ba5', 185199)"; SQL statement:
Mar 22 09:05:11 rpki rpki-validator-3.sh: /* insert net.ripe.rpki.validator3.domain.RpkiObject */ insert into rpki_object (created_at, updated_at, version, authority_key_identifier, last_marked_reachable_at, serial_number, sha256, signing_time, type, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) [23505-197]
Mar 22 09:05:11 rpki rpki-validator-3.sh: 2019-03-22 09:05:11,152 #011 ERROR #011 quartzScheduler_Worker-8 #011#011#011 o.h.i.ExceptionMapperStandardImpl #011#011 HHH000346: Error during managed flush [org.hibernate.exception.ConstraintViolationException: could not execute statement]
Mar 22 09:05:16 rpki rpki-rtr-server.sh: 2019-03-22 09:05:16.461  WARN 3410 --- [eduler_Worker-1] n.r.r.r.a.v.RefreshCacheController       : validator http://localhost:8080/api/objects/validated not ready yet, will retry later

Reported primary key violations were there also before upgrade from build 386 to 397.

wibisono commented 5 years ago

We are trying to be a bit more careful on releasing, but there is a release candidate version 406 if you wanted to have help us for early testing: https://ftp.ripe.net/tools/rpki/validator3/rc/centos7/repo/

Main changes:

On Sat, Mar 30, 2019 at 3:06 PM Job Snijders notifications@github.com wrote:

Almost two weeks have passed since reporting this issue. It appears that at this point in time there no longer is any version of the RPKI v3 validator that runs stable.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RIPE-NCC/rpki-validator-3/issues/77#issuecomment-478248967, or mute the thread https://github.com/notifications/unsubscribe-auth/AAijg1MlrhvReoUGOr5U7PfHV05n_sMxks5vb29cgaJpZM4b76Kk .

d4nys3k commented 5 years ago

3.1 (_2019.07.0406.41.46) startup with clean (empty) database was pretty fast. Database backend change from H2 to Xodus probably solves this issue.