NICMx / FORT-validator

RPKI cache validator
MIT License
50 stars 24 forks source link

[1.5.0] Mode standalone sometimes generates no output #49

Closed dorpauli closed 3 years ago

dorpauli commented 3 years ago

Hello,

currently i run some tests with fort 1.5.0 and observe some unusual behaviour. On a Raspberry Pi 4 with 4 GB (raspberry pi os lite 64bit) i did every half an hour the validation run with fort --mode standalone --log.level debug --validation-log.enabled=true --validation-log.level warning --local-repository /home/pi/.rpki-cache --output.roa /home/pi/csv/$date.csv --tal /home/pi/tals. Most of the time it runs smoothly through the validation within 6 to 10 minutes. But sometimes (about 10 time within 2 days) there is an empty csv file and more often it generates the output 1-2 hours after the validation started. The Log shows different behaviour. Sometimes there are validation entries and sometimes not. Also a timeout occurs, when the output is empty.

I thought maybe it' related to the SBC and so i did a one day test on a powerful server (Ubuntu 20.04). In this one day, there were only 2 empty files and no delay on the output generation. So it's kind of related to the raspberry, but i think the 2 empty files are still problematic.

Now i upload some log outputs of the different cases(date always means the start time CET). Raspberry pi 4 empty output: 2021-04-02_16-00.log 2021-04-02_17-30.log 2021-04-03_00-00.log 2021-04-03_04-00.log 2021-04-04_05-30.log

Raspberry pi 4 validate over hours but full output: 2021-04-02_12-30.log 2021-04-02_13-30.log 2021-04-02_19-30.log 2021-04-03_09-30.log

Ubuntu Server no output: 2021-04-07_00-00.log 2021-04-07_07-30.log

ydahhrk commented 3 years ago

Hang on; don't run the code I just uploaded. It has another bug.

Working...

ydahhrk commented 3 years ago

Ok, new commit: https://github.com/NICMx/FORT-validator/commit/3700699020894fef7b34f9e8daf8a4c28cd2150b

Thank you. The Ubuntu output was pretty telling; it led me to a race condition that is now fixed.

Can you please test the patch? https://github.com/NICMx/FORT-validator/tree/issue49 The Ubuntu problem should be gone, but I'm not completely sure the Raspberry issue is the same bug.

dorpauli commented 3 years ago

Thank you for your efforts.

I will compile the repository and restart the test on Ubuntu in a few hours. I will do it also on the raspberry pi, if i got time today. Perhaps the fix also affect the behaviour of the Pi. I will find out.

Did this also effect Fort running as service or is it standalone only?

dorpauli commented 3 years ago

I now see a connection between the 2 issues on the raspberry pi. Maybe i uploaded the wrong logs.

In my new logs, the one process of fort begins the validation right after cron startet the next fort process. Maybe the older worker then gets the signal of the new process and starts validating and the new worker runs into a timeout or terminates directly.

dorpauli commented 3 years ago

Now both systems are running more than 1 day every half an hour. The empty output problem seems to be gone. Also, on the Ubuntu Server the output was generated in a few minutes after fort was started. At least until a few hours ago.

Since then i can also see a delay on the termination on this machine. In such a case the execution terminates after the next validation process was started and sometimes also finished. On the raspberry pi this behavior occurs more often, but i can see it also on the powerful ubuntu server.

So one problem seems to be solved and the other still exists. Maybe the new logs from ubuntu could help:

2021-04-09_09-00.log 2021-04-09_09-30.log 2021-04-09_10-00.log

ydahhrk commented 3 years ago

Did this also effect Fort running as service or is it standalone only?

Both. But it only triggered during Fort startup, so if it didn't happen once, the server's daemon nature prevented it from happening again.

In my new logs, the one process of fort begins the validation right after cron startet the next fort process.

Hmm... ok, are you saying they might overlap?

They don't block each other on their own. If one Fort starts while another is running, and they operate on the same repository... well, I don't see much havoc ensuing, but Fort wasn't developed with this scenario in mind. Results are undefined, I suppose.

If this is being a problem, I suppose Fort should write a lock to make sure it's the only one having access to a local repository.

Maybe the older worker then gets the signal of the new process and starts validating and the new worker runs into a timeout or terminates directly.

They don't communicate with each other.

The empty output problem seems to be gone.

This is easy to explain: Because of the race condition, the worker threads sent the ready signal before the parent thread started listening to it.

Therefore, the parent would timeout, and it would therefore never start the validation.

It wasn't just the file; the whole validation cycle was not done.

Also, on the Ubuntu Server the output was generated in a few minutes after fort was started. At least until a few hours ago.

Are you talking about the ROA file or the logs?

If you're talking about the ROA file, then it's probably normal. Fort has to update and validate the repository first, and THEN write the file. Otherwise it would write dirty ROAs. ("Print after validation to avoid duplicated info")

the execution terminates after the next validation process was started and sometimes also finished.

Ok, so they are overlapping.

Do you know how to manage locks from the shell?

dorpauli commented 3 years ago

Hmm... ok, are you saying they might overlap?

They don't block each other on their own. If one Fort starts while another is running, and they operate on the same repository... well, I don't see much havoc ensuing, but Fort wasn't developed with this scenario in mind. Results are undefined, I suppose.

If this is being a problem, I suppose Fort should write a lock to make sure it's the only one having access to a local repository. They don't communicate with each other.

Ok so it is possible, that they are overlapping and they don't affect each other.

Are you talking about the ROA file or the logs?

If you're talking about the ROA file, then it's probably normal. Fort has to update and validate the repository first, and THEN write the file. Otherwise it would write dirty ROAs. ("Print after validation to avoid duplicated info")

The only reason i mentioned this is, that most of the time fort finished validation after 3 to 5 minutes but then there are scenarios, where it finishes 3 to 5 minutes after cron starts the next fort instance. And cron is triggered every 30 minutes.

Ok, so they are overlapping.

Do you know how to manage locks from the shell?

Not yet, but i will learn, thanks. But now i just wanted to understand the behavior i recognize currently.

Sadly there are new empty outputs by the way. The log outputs are again on the bottom.

2021-04-09_10-30.log 2021-04-10_05-00.log 2021-04-10_05-30.log 2021-04-10_06-00.log 2021-04-10_08-00.log 2021-04-10_15-00.log

lukastribus commented 3 years ago

This has nothing to do directly with the problem, please forgive the OT, but when you run in standalone mode from cron, you should have some kind of monitoring to understand when things go wrong (or when the cron job doesn't even start in the first place for some reason).

I've used healthchecks.io for this, but I'm sure there are other solutions.

I'm mentioning this because when the validator doesn't run for whatever reason, your routers will act based on obsolete VRP data. I also suggest monitoring the RTR server.

ydahhrk commented 3 years ago

Hello again,

Still haven't found the problem. I just uploaded a commit that will dump more debug information on the logs. I'd be thankful if you would try it.

New question: Have you tried Fort 1.4.2? It would be useful to know if the bug started in 1.5.0. Also, if 1.4.2 doesn't have the bug, you could use it instead of 1.5.0 in the meantime. 1.4.2 hasn't been deprecated yet.

dorpauli commented 3 years ago

I started a new test run with the new debug commit from branch "issue49" and also with version 1.4.2. on two different instances of the powerful server.

By the way, I didn't use 1.4.2. in the past. I am just evaluating some validators and found this behavior (bug) and wanted to help.

dorpauli commented 3 years ago

After 3 days now the validator from branch "issue49" runs without any anomalies. From Fort 1.4.2 i only see one empty file. So probably it is the same issue.

I have to admit one mistake here. The last time i tested your changes in code i downloaded the zip archive. I checked the correctness for the new test and it downloaded the master branch instead of the selected branch "issue49" (although zip-file was named like it!). For the current test i checked out the right branch with git and verified, that the changes in code are present. So the results from the last days are from the right code. Sorry for that, but i was sure to get the right branch with the download.

I will let the tests run for a while and maybe start it also on the raspberry. if nothing happens this issue is closed i think.

dorpauli commented 3 years ago

By the way, there are now Log entries where the line break is not right.

Apr 27 10:30:01 DBG: Pool 'Internal': Thread #8 spawnedApr 27 10:30:01 
DBG: Thread Internal.8: Waiting for work...
ydahhrk commented 3 years ago

From Fort 1.4.2 i only see one empty file. So probably it is the same issue.

Wait, what do you mean? 1.4.2 is persistently writing no logs?

If you're talking about the thread pool debugging messages, then I think it's normal. 1.4.2 didn't have a thread pool.

By the way, there are now Log entries where the line break is not right.

Presumably, this is normal; it's two threads writing simultaneously.

If it's a problem, I could add a mutex to properly synchronize all logging. Maybe as an optional configuration option.

dorpauli commented 3 years ago

Wait, what do you mean? 1.4.2 is persistently writing no logs?

1.4.2 is writing all logs, there is just the same issue as in my first post. To be exactly, there are 244 output files with VRP until now (i ran it since friday) and from all of them are 3 empty.

I thought the new code from you for 1.5.0 has solved the problem, but this morning 2 validation runs also have given no output. Here are the logs from 1.5.0 again...:

2021-04-28_04-00.log (By the way, at this time both validators (1.4.2 and 1.5.0) generated no output. Random?) 2021-04-28_07-30.log

Secondly It's curious, that i can't see any delay anymore on all tests. I described before what i mean:

Since then i can also see a delay on the termination on this machine. In such a case the execution terminates after the next validation process was started and sometimes also finished. On the raspberry pi this behavior occurs more often, but i can see it also on the powerful ubuntu server.

Could it be, that this behavior was caused by publication points?

ydahhrk commented 3 years ago

Could it be, that this behavior was caused by publication points?

Yes, that's the logical next theory. Either Fort is hanging forever (well, you said you haven't seen delays, so I guess it's not the case), or something is killing it while validating. Can you enable validation logs?

dorpauli commented 3 years ago

I run all validations with validation log level warning:

fort --mode standalone --log.level debug --validation-log.enabled=true --validation-log.level warning --local-repository /home/pi/.rpki-cache --output.roa /home/pi/csv/$date.csv --tal /home/pi/tals

What level do you recommend?

ydahhrk commented 3 years ago

Rats, I need to pay more attention.

Debug, please. Hopefully, the last message should appear near the bug.

On Wed, Apr 28, 2021, 11:42 PM Paul F. @.***> wrote:

I run all validations with validation log level warning:

fort --mode standalone --log.level debug --validation-log.enabled=true --validation-log.level warning --local-repository /home/pi/.rpki-cache --output.roa /home/pi/csv/$date.csv --tal /home/pi/tals

What level do you recommend?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NICMx/FORT-validator/issues/49#issuecomment-828935027, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASHNF2CWGZWCMQSIAJMRYTTLDPSNANCNFSM42QJATWQ .

dorpauli commented 3 years ago

Now the weekend is over and it again worked fine until this morning (2 empty output files). I really think the current problem is caused by the internet connection.

So the issue can be closed?

Nevertheless, here are the logs from the empty vrp:

2021-05-04_05-30.log 2021-05-04_06-00.log

ydahhrk commented 3 years ago

I really think the current problem is caused by the internet connection.

Well, even if the problem is a connection failure, Fort shouldn't terminate wordlessly. That, if nothing else, is certainly a bug.

So the issue can be closed?

The latest output did lead me to a handful of code that's rather lax on error handling. I'm currently improving it, and with a bit of luck, this should be the end of the bug.

I'm aware I haven't been stellar at solving this problem, so if you're tired of testing, feel free to drop the issue. I'll release a new version once I'm done polishing in any case.

dorpauli commented 3 years ago

I can test it again if you want. Are the needed changes done?

ydahhrk commented 3 years ago

Not all of them, but I've already uploaded several self-contained improvements that should be stable, and will probably upload more at the end of the day.

If you can run them, that'd be great.

ydahhrk commented 3 years ago

Note: Last relevant commit is 7870445a52a93e8385b9a702feb45820952e989f (issue49, as usual.)

dorpauli commented 3 years ago

Note: Last relevant commit is 7870445 (issue49, as usual.)

But it seems, that you have pushed it also to master... I will let it run

dorpauli commented 3 years ago

I have another log for you. It's the same behavior again. At 6 am this morning, the csv file was empty. Every single other run is alright. I started the validator every 30 mins with cron.

I hope it was all right with fetching the latest code and compiling, I checked with git show the latest commit and that was the one you've mentioned. But i think you will see in the logs.

2021-05-27_06-00.log

ydahhrk commented 3 years ago

New test commit: issue49

Also:

  1. Can you please provide Fort's entire crontab line (the one from crontab -l or crontab -e)? (Including redirections, if any)
  2. It's time to start investigating whether some external agent is killing Fort. Can you see logging messages near the empty run's time in some of the following files?
    1. /var/log/syslog
    2. /var/log/messages
    3. /var/log/kern.log
    4. Any other suspicious files in /var/log
dorpauli commented 3 years ago

It's very strange, because today it happens exactly at the same time! 2021-05-28_06-00.log

The cron job 0,30 * * * * /home/paul/collect_vrps.sh > /dev/null runs this script:

#!/bin/bash

datum=$(date +"%Y-%m-%d_%H-%M")

/usr/local/bin/fort --mode standalone --log.level debug --validation-log.enabled=true --validation-log.level warning --local-repository /home/paul/.rpki-cache --output.roa /home/paul/csv/$datum.csv --tal /home/paul/tals &> /home/paul/log/$datum.log

Syslog around this time with the fort run before and after:

May 28 05:30:01 fort fort[19059]: INF: Console log output configured; disabling validation logging on syslog.
May 28 05:30:01 fort fort[19059]: INF: (Validation Logs will be sent to the standard streams only.)
May 28 05:30:01 fort fort[19059]: INF: Console log output configured; disabling operation logging on syslog.
May 28 05:30:01 fort fort[19059]: INF: (Operation Logs will be sent to the standard streams only.)
May 28 05:53:19 fort systemd[1]: fwupd.service: Succeeded.
May 28 06:00:01 fort CRON[19207]: (paul) CMD (/home/paul/collect_vrps.sh > /dev/null)
May 28 06:00:01 fort fort[19210]: INF: Console log output configured; disabling validation logging on syslog.
May 28 06:00:01 fort fort[19210]: INF: (Validation Logs will be sent to the standard streams only.)
May 28 06:00:01 fort fort[19210]: INF: Console log output configured; disabling operation logging on syslog.
May 28 06:00:01 fort fort[19210]: INF: (Operation Logs will be sent to the standard streams only.)
May 28 06:17:01 fort CRON[19245]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 28 06:25:01 fort CRON[19250]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
May 28 06:30:01 fort CRON[19318]: (paul) CMD (/home/paul/collect_vrps.sh > /dev/null)
May 28 06:30:01 fort fort[19321]: INF: Console log output configured; disabling validation logging on syslog.

Since server start 10 pm on 26th of may Kernel log got nothing in it than that:

May 28 06:55:30 fort kernel: [160503.022966] kauditd_printk_skb: 20 callbacks suppressed
May 28 06:55:30 fort kernel: [160503.022989] audit: type=1400 audit(1622177730.612:32): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=19821 comm="apparmor_parser"
May 28 06:55:30 fort kernel: [160503.023174] audit: type=1400 audit(1622177730.612:33): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=19821 comm="apparmor_parser"
May 28 06:55:30 fort kernel: [160503.023577] audit: type=1400 audit(1622177730.616:34): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=19821 comm="apparmor_parser"
May 28 06:55:30 fort kernel: [160503.027498] audit: type=1400 audit(1622177730.620:35): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/{,usr/}sbin/dhclient" pid=19821 comm="apparmor_parser"

In auth.log you can see, that this cron session only lasts for about 6 secs in comparison to 3 mins on normal runs:

May 28 05:30:01 fort CRON[19055]: pam_unix(cron:session): session opened for user paul by (uid=0)
May 28 05:33:25 fort CRON[19055]: pam_unix(cron:session): session closed for user paul
May 28 06:00:01 fort CRON[19206]: pam_unix(cron:session): session opened for user paul by (uid=0)
May 28 06:00:07 fort CRON[19206]: pam_unix(cron:session): session closed for user paul
dorpauli commented 3 years ago

Last Log of a file, that is empty:

Jun  1 06:00:01 INF: Console log output configured; disabling validation logging on syslog.
Jun  1 06:00:01 INF: (Validation Logs will be sent to the standard streams only.)
Jun  1 06:00:01 INF: Console log output configured; disabling operation logging on syslog.
Jun  1 06:00:01 INF: (Operation Logs will be sent to the standard streams only.)
Jun  1 06:00:01 INF: Configuration {
Jun  1 06:00:01 INF: tal: /home/paul/tals
Jun  1 06:00:01 INF: local-repository: /home/paul/.rpki-cache
Jun  1 06:00:01 INF: sync-strategy: root-except-ta
Jun  1 06:00:01 INF: shuffle-uris: false
Jun  1 06:00:01 INF: maximum-certificate-depth: 32
Jun  1 06:00:01 INF: slurm: (null)
Jun  1 06:00:01 INF: mode: standalone
Jun  1 06:00:01 INF: work-offline: false
Jun  1 06:00:01 INF: daemon: false
Jun  1 06:00:01 INF: server.address:
Jun  1 06:00:01 INF:   <Nothing>
Jun  1 06:00:01 INF: server.port: 323
Jun  1 06:00:01 INF: server.backlog: 4096
Jun  1 06:00:01 INF: server.interval.validation: 3600
Jun  1 06:00:01 INF: server.interval.refresh: 3600
Jun  1 06:00:01 INF: server.interval.retry: 600
Jun  1 06:00:01 INF: server.interval.expire: 7200
Jun  1 06:00:01 INF: rsync.enabled: true
Jun  1 06:00:01 INF: rsync.priority: 50
Jun  1 06:00:01 INF: rsync.strategy: root-except-ta
Jun  1 06:00:01 INF: rsync.retry.count: 2
Jun  1 06:00:01 INF: rsync.retry.interval: 5
Jun  1 06:00:01 INF: rsync.program: rsync
Jun  1 06:00:01 INF: rsync.arguments-recursive:
Jun  1 06:00:01 INF:   --recursive
Jun  1 06:00:01 INF:   --delete
Jun  1 06:00:01 INF:   --times
Jun  1 06:00:01 INF:   --contimeout=20
Jun  1 06:00:01 INF:   --timeout=15
Jun  1 06:00:01 INF:   $REMOTE
Jun  1 06:00:01 INF:   $LOCAL
Jun  1 06:00:01 INF: rsync.arguments-flat:
Jun  1 06:00:01 INF:   --times
Jun  1 06:00:01 INF:   --contimeout=20
Jun  1 06:00:01 INF:   --timeout=15
Jun  1 06:00:01 INF:   --dirs
Jun  1 06:00:01 INF:   $REMOTE
Jun  1 06:00:01 INF:   $LOCAL
Jun  1 06:00:01 INF: rrdp.enabled: true
Jun  1 06:00:01 INF: rrdp.priority: 60
Jun  1 06:00:01 INF: rrdp.retry.count: 2
Jun  1 06:00:01 INF: rrdp.retry.interval: 5
Jun  1 06:00:01 INF: http.enabled: true
Jun  1 06:00:01 INF: http.priority: 60
Jun  1 06:00:01 INF: http.retry.count: 2
Jun  1 06:00:01 INF: http.retry.interval: 5
Jun  1 06:00:01 INF: http.user-agent: fort/1.5.0
Jun  1 06:00:01 INF: http.connect-timeout: 30
Jun  1 06:00:01 INF: http.transfer-timeout: 0
Jun  1 06:00:01 INF: http.idle-timeout: 15
Jun  1 06:00:01 INF: http.ca-path: (null)
Jun  1 06:00:01 INF: log.enabled: true
Jun  1 06:00:01 INF: log.output: console
Jun  1 06:00:01 INF: log.level: debug
Jun  1 06:00:01 INF: log.tag: (null)
Jun  1 06:00:01 INF: log.facility: daemon
Jun  1 06:00:01 INF: log.file-name-format: global-url
Jun  1 06:00:01 INF: log.color-output: false
Jun  1 06:00:01 INF: validation-log.enabled: true
Jun  1 06:00:01 INF: validation-log.output: console
Jun  1 06:00:01 INF: validation-log.level: warning
Jun  1 06:00:01 INF: validation-log.tag: Validation
Jun  1 06:00:01 INF: validation-log.facility: daemon
Jun  1 06:00:01 INF: validation-log.file-name-format: global-url
Jun  1 06:00:01 INF: validation-log.color-output: false
Jun  1 06:00:01 INF: Custom incidences:
Jun  1 06:00:01 INF:   incid-hashalg-has-params (Signed Object's hash algorithm has NULL object as parameters): ignore
Jun  1 06:00:01 INF:   incid-obj-not-der-encoded (Object isn't DER encoded): ignore
Jun  1 06:00:01 INF:   incid-file-at-mft-not-found (File listed at manifest doesn't exist): error
Jun  1 06:00:01 INF:   incid-file-at-mft-hash-not-match (File hash listed at manifest doesn't match the actual file hash): error
Jun  1 06:00:01 INF:   incid-mft-stale (The current time is after the nextUpdate field at the manifest): error
Jun  1 06:00:01 INF:   incid-crl-stale (The current time is after the nextUpdate field at the CRL): error
Jun  1 06:00:01 INF: output.roa: /home/paul/csv/2021-06-01_06-00.csv
Jun  1 06:00:01 INF: output.bgpsec: (null)
Jun  1 06:00:01 INF: output.format: csv
Jun  1 06:00:01 INF: asn1-decode-max-stack: 4096
Jun  1 06:00:01 INF: stale-repository-period: 43200
Jun  1 06:00:01 INF: init-tals: false
Jun  1 06:00:01 INF: thread-pool.server.max: 20
Jun  1 06:00:01 INF: thread-pool.validation.max: 5
Jun  1 06:00:01 INF: }
Jun  1 06:00:01 DBG: rpkiManifest registered. Its nid is 1195.
Jun  1 06:00:01 DBG: signedObject registered. Its nid is 1196.
Jun  1 06:00:01 DBG: rpkiNotify registered. Its nid is 1197.
Jun  1 06:00:01 DBG: id-cp-ipAddr-asNumber (RFC 6484) registered. Its nid is 1198.
Jun  1 06:00:01 DBG: id-cp-ipAddr-asNumber-v2 (RFC 8360) registered. Its nid is 1199.
Jun  1 06:00:01 DBG: id-pe-ipAddrBlocks-v2 registered. Its nid is 1200.
Jun  1 06:00:01 DBG: id-pe-autonomousSysIds-v2 registered. Its nid is 1201.
Jun  1 06:00:01 DBG: id-kp-bgpsec-router registered. Its nid is 1202.
Jun  1 06:00:01 DBG: Pool 'Internal': Thread #0 spawned
Jun  1 06:00:01 DBG: Thread Internal.0: Waiting for work...
Jun  1 06:00:01 DBG: Pool 'Internal': Thread #1 spawned
Jun  1 06:00:01 DBG: Thread Internal.1: Waiting for work...
Jun  1 06:00:01 DBG: Thread Internal.2: Waiting for work...
Jun  1 06:00:01 DBG: Pool 'Internal': Thread #2 spawned
Jun  1 06:00:01 DBG: Pool 'Internal': Thread #3 spawned
Jun  1 06:00:01 DBG: Pool 'Internal': Thread #4 spawned
Jun  1 06:00:01 DBG: Thread Internal.3: Waiting for work...
Jun  1 06:00:01 DBG: Pool 'Internal': Thread #5 spawned
Jun  1 06:00:01 DBG: Thread Internal.4: Waiting for work...
Jun  1 06:00:01 DBG: Thread Internal.5: Waiting for work...
Jun  1 06:00:01 DBG: Pool 'Internal': Thread #6 spawned
Jun  1 06:00:01 DBG: Thread Internal.6: Waiting for work...
Jun  1 06:00:01 DBG: Pool 'Internal': Thread #7 spawned
Jun  1 06:00:01 DBG: Pool 'Internal': Thread #8 spawned
Jun  1 06:00:01 DBG: Pool 'Internal': Thread #9 spawned
Jun  1 06:00:01 DBG: Thread Internal.7: Waiting for work...
Jun  1 06:00:01 DBG: Thread Internal.8: Waiting for work...
Jun  1 06:00:01 DBG: Pool 'Validation': Thread #0 spawned
Jun  1 06:00:01 DBG: Pool 'Validation': Thread #1 spawned
Jun  1 06:00:01 DBG: Pool 'Validation': Thread #2 spawned
Jun  1 06:00:01 DBG: Pool 'Validation': Thread #3 spawned
Jun  1 06:00:01 DBG: Pool 'Validation': Thread #4 spawned
Jun  1 06:00:01 WRN: The validation has begun.
Jun  1 06:00:01 INF: Starting validation.
Jun  1 06:00:01 DBG: Thread Validation.0: Waiting for work...
Jun  1 06:00:01 DBG: Pool 'Validation': Pushed task '/home/paul/tals/lacnic.tal'
Jun  1 06:00:01 DBG: Pool 'Validation': Pushed task '/home/paul/tals/afrinic.tal'
Jun  1 06:00:01 DBG: Pool 'Validation': Pushed task '/home/paul/tals/apnic.tal'
Jun  1 06:00:01 DBG: Pool 'Validation': Pushed task '/home/paul/tals/arin.tal'
Jun  1 06:00:01 DBG: Pool 'Validation': Pushed task '/home/paul/tals/ripe.tal'
Jun  1 06:00:01 DBG: - Stop: false
Jun  1 06:00:01 DBG: - Working count: 0
Jun  1 06:00:01 DBG: - Thread count: 1
Jun  1 06:00:01 DBG: - Empty task queue: false
Jun  1 06:00:01 DBG: Pool 'Validation': Waiting for tasks to be completed
Jun  1 06:00:01 DBG: Thread Validation.0: Claimed task '/home/paul/tals/lacnic.tal'
thread/thread_pool.c:218 (tasks_poll())
object/tal.c:690 (do_file_validation())
object/tal.c:695 (do_file_validation())
object/tal.c:699 (do_file_validation())
object/tal.c:705 (do_file_validation())
object/tal.c:711 (do_file_validation())
object/tal.c:667 (__handle_tal_uri_sync())
object/tal.c:522 (handle_tal_uri())
object/tal.c:529 (handle_tal_uri())
object/tal.c:537 (handle_tal_uri())
object/tal.c:540 (handle_tal_uri())
Jun  1 06:00:01 DBG: Thread Validation.1: Claimed task '/home/paul/tals/afrinic.tal'
thread/thread_pool.c:218 (tasks_poll())
object/tal.c:690 (do_file_validation())
object/tal.c:695 (do_file_validation())
object/tal.c:699 (do_file_validation())
object/tal.c:705 (do_file_validation())
object/tal.c:711 (do_file_validation())
object/tal.c:667 (__handle_tal_uri_sync())
object/tal.c:522 (handle_tal_uri())
object/tal.c:529 (handle_tal_uri())
object/tal.c:537 (handle_tal_uri())
object/tal.c:540 (handle_tal_uri())
Jun  1 06:00:01 DBG: Thread Internal.9: Waiting for work...
Jun  1 06:00:01 DBG: Thread Validation.2: Claimed task '/home/paul/tals/apnic.tal'
thread/thread_pool.c:218 (tasks_poll())
object/tal.c:690 (do_file_validation())
object/tal.c:695 (do_file_validation())
object/tal.c:699 (do_file_validation())
object/tal.c:705 (do_file_validation())
object/tal.c:711 (do_file_validation())
object/tal.c:667 (__handle_tal_uri_sync())
object/tal.c:522 (handle_tal_uri())
object/tal.c:529 (handle_tal_uri())
object/tal.c:537 (handle_tal_uri())
object/tal.c:540 (handle_tal_uri())
Jun  1 06:00:01 DBG: Thread Validation.3: Claimed task '/home/paul/tals/arin.tal'
thread/thread_pool.c:218 (tasks_poll())
object/tal.c:690 (do_file_validation())
object/tal.c:695 (do_file_validation())
object/tal.c:699 (do_file_validation())
object/tal.c:705 (do_file_validation())
object/tal.c:711 (do_file_validation())
object/tal.c:667 (__handle_tal_uri_sync())
object/tal.c:522 (handle_tal_uri())
object/tal.c:529 (handle_tal_uri())
object/tal.c:537 (handle_tal_uri())
object/tal.c:540 (handle_tal_uri())
Jun  1 06:00:01 DBG: Thread Validation.4: Claimed task '/home/paul/tals/ripe.tal'
ydahhrk commented 3 years ago

Interesting. What are the URLs contained in lacnic.tal, afrinic.tal, apnic.tal and arin.tal?

dorpauli commented 3 years ago

arin.tal

rsync://rpki.arin.net/repository/arin-rpki-ta.cer
https://rrdp.arin.net/arin-rpki-ta.cer

apnic.tal

https://rpki.apnic.net/repository/apnic-rpki-root-iana-origin.cer
rsync://rpki.apnic.net/repository/apnic-rpki-root-iana-origin.cer

afrinic.tal

https://rpki.afrinic.net/repository/AfriNIC.cer
rsync://rpki.afrinic.net/repository/AfriNIC.cer

lacnic.tal

https://rrdp.lacnic.net/ta/rta-lacnic-rpki.cer
rsync://repository.lacnic.net/rpki/lacnic/rta-lacnic-rpki.cer

ripe.tal

https://rpki.ripe.net/ta/ripe-ncc-ta.cer
rsync://rpki.ripe.net/ta/ripe-ncc-ta.cer
ydahhrk commented 3 years ago

I'm sorry this has taken so long.

New commit.

dorpauli commented 3 years ago

I'm sorry this has taken so long.

No problem, i hope you will find the bug.

Heres another log without installation of the new commit (Don't think this will help, but it don't do any harm). Will give you the new logs of the new version, when another empty file is generated:

Jun  4 06:00:01 INF: Console log output configured; disabling validation logging on syslog.
Jun  4 06:00:01 INF: (Validation Logs will be sent to the standard streams only.)
Jun  4 06:00:01 INF: Console log output configured; disabling operation logging on syslog.
Jun  4 06:00:01 INF: (Operation Logs will be sent to the standard streams only.)
Jun  4 06:00:01 INF: Configuration {
Jun  4 06:00:01 INF: tal: /home/paul/tals
Jun  4 06:00:01 INF: local-repository: /home/paul/.rpki-cache
Jun  4 06:00:01 INF: sync-strategy: root-except-ta
Jun  4 06:00:01 INF: shuffle-uris: false
Jun  4 06:00:01 INF: maximum-certificate-depth: 32
Jun  4 06:00:01 INF: slurm: (null)
Jun  4 06:00:01 INF: mode: standalone
Jun  4 06:00:01 INF: work-offline: false
Jun  4 06:00:01 INF: daemon: false
Jun  4 06:00:01 INF: server.address:
Jun  4 06:00:01 INF:   <Nothing>
Jun  4 06:00:01 INF: server.port: 323
Jun  4 06:00:01 INF: server.backlog: 4096
Jun  4 06:00:01 INF: server.interval.validation: 3600
Jun  4 06:00:01 INF: server.interval.refresh: 3600
Jun  4 06:00:01 INF: server.interval.retry: 600
Jun  4 06:00:01 INF: server.interval.expire: 7200
Jun  4 06:00:01 INF: rsync.enabled: true
Jun  4 06:00:01 INF: rsync.priority: 50
Jun  4 06:00:01 INF: rsync.strategy: root-except-ta
Jun  4 06:00:01 INF: rsync.retry.count: 2
Jun  4 06:00:01 INF: rsync.retry.interval: 5
Jun  4 06:00:01 INF: rsync.program: rsync
Jun  4 06:00:01 INF: rsync.arguments-recursive:
Jun  4 06:00:01 INF:   --recursive
Jun  4 06:00:01 INF:   --delete
Jun  4 06:00:01 INF:   --times
Jun  4 06:00:01 INF:   --contimeout=20
Jun  4 06:00:01 INF:   --timeout=15
Jun  4 06:00:01 INF:   $REMOTE
Jun  4 06:00:01 INF:   $LOCAL
Jun  4 06:00:01 INF: rsync.arguments-flat:
Jun  4 06:00:01 INF:   --times
Jun  4 06:00:01 INF:   --contimeout=20
Jun  4 06:00:01 INF:   --timeout=15
Jun  4 06:00:01 INF:   --dirs
Jun  4 06:00:01 INF:   $REMOTE
Jun  4 06:00:01 INF:   $LOCAL
Jun  4 06:00:01 INF: rrdp.enabled: true
Jun  4 06:00:01 INF: rrdp.priority: 60
Jun  4 06:00:01 INF: rrdp.retry.count: 2
Jun  4 06:00:01 INF: rrdp.retry.interval: 5
Jun  4 06:00:01 INF: http.enabled: true
Jun  4 06:00:01 INF: http.priority: 60
Jun  4 06:00:01 INF: http.retry.count: 2
Jun  4 06:00:01 INF: http.retry.interval: 5
Jun  4 06:00:01 INF: http.user-agent: fort/1.5.0
Jun  4 06:00:01 INF: http.connect-timeout: 30
Jun  4 06:00:01 INF: http.transfer-timeout: 0
Jun  4 06:00:01 INF: http.idle-timeout: 15
Jun  4 06:00:01 INF: http.ca-path: (null)
Jun  4 06:00:01 INF: log.enabled: true
Jun  4 06:00:01 INF: log.output: console
Jun  4 06:00:01 INF: log.level: debug
Jun  4 06:00:01 INF: log.tag: (null)
Jun  4 06:00:01 INF: log.facility: daemon
Jun  4 06:00:01 INF: log.file-name-format: global-url
Jun  4 06:00:01 INF: log.color-output: false
Jun  4 06:00:01 INF: validation-log.enabled: true
Jun  4 06:00:01 INF: validation-log.output: console
Jun  4 06:00:01 INF: validation-log.level: warning
Jun  4 06:00:01 INF: validation-log.tag: Validation
Jun  4 06:00:01 INF: validation-log.facility: daemon
Jun  4 06:00:01 INF: validation-log.file-name-format: global-url
Jun  4 06:00:01 INF: validation-log.color-output: false
Jun  4 06:00:01 INF: Custom incidences:
Jun  4 06:00:01 INF:   incid-hashalg-has-params (Signed Object's hash algorithm has NULL object as parameters): ignore
Jun  4 06:00:01 INF:   incid-obj-not-der-encoded (Object isn't DER encoded): ignore
Jun  4 06:00:01 INF:   incid-file-at-mft-not-found (File listed at manifest doesn't exist): error
Jun  4 06:00:01 INF:   incid-file-at-mft-hash-not-match (File hash listed at manifest doesn't match the actual file hash): error
Jun  4 06:00:01 INF:   incid-mft-stale (The current time is after the nextUpdate field at the manifest): error
Jun  4 06:00:01 INF:   incid-crl-stale (The current time is after the nextUpdate field at the CRL): error
Jun  4 06:00:01 INF: output.roa: /home/paul/csv/2021-06-04_06-00.csv
Jun  4 06:00:01 INF: output.bgpsec: (null)
Jun  4 06:00:01 INF: output.format: csv
Jun  4 06:00:01 INF: asn1-decode-max-stack: 4096
Jun  4 06:00:01 INF: stale-repository-period: 43200
Jun  4 06:00:01 INF: init-tals: false
Jun  4 06:00:01 INF: thread-pool.server.max: 20
Jun  4 06:00:01 INF: thread-pool.validation.max: 5
Jun  4 06:00:01 INF: }
Jun  4 06:00:01 DBG: rpkiManifest registered. Its nid is 1195.
Jun  4 06:00:01 DBG: signedObject registered. Its nid is 1196.
Jun  4 06:00:01 DBG: rpkiNotify registered. Its nid is 1197.
Jun  4 06:00:01 DBG: id-cp-ipAddr-asNumber (RFC 6484) registered. Its nid is 1198.
Jun  4 06:00:01 DBG: id-cp-ipAddr-asNumber-v2 (RFC 8360) registered. Its nid is 1199.
Jun  4 06:00:01 DBG: id-pe-ipAddrBlocks-v2 registered. Its nid is 1200.
Jun  4 06:00:01 DBG: id-pe-autonomousSysIds-v2 registered. Its nid is 1201.
Jun  4 06:00:01 DBG: id-kp-bgpsec-router registered. Its nid is 1202.
Jun  4 06:00:01 DBG: Pool 'Internal': Thread #0 spawned
Jun  4 06:00:01 DBG: Pool 'Internal': Thread #1 spawned
Jun  4 06:00:01 DBG: Thread Internal.0: Waiting for work...
Jun  4 06:00:01 DBG: Thread Internal.1: Waiting for work...
Jun  4 06:00:01 DBG: Pool 'Internal': Thread #2 spawned
Jun  4 06:00:01 DBG: Thread Internal.2: Waiting for work...
Jun  4 06:00:01 DBG: Pool 'Internal': Thread #3 spawned
Jun  4 06:00:01 DBG: Thread Internal.3: Waiting for work...
Jun  4 06:00:01 DBG: Pool 'Internal': Thread #4 spawned
Jun  4 06:00:01 DBG: Thread Internal.4: Waiting for work...
Jun  4 06:00:01 DBG: Pool 'Internal': Thread #5 spawned
Jun  4 06:00:01 DBG: Thread Internal.5: Waiting for work...
Jun  4 06:00:01 DBG: Pool 'Internal': Thread #6 spawned
Jun  4 06:00:01 DBG: Thread Internal.6: Waiting for work...
Jun  4 06:00:01 DBG: Pool 'Internal': Thread #7 spawned
Jun  4 06:00:01 DBG: Pool 'Internal': Thread #8 spawned
Jun  4 06:00:01 DBG: Thread Internal.7: Waiting for work...
Jun  4 06:00:01 DBG: Pool 'Internal': Thread #9 spawned
Jun  4 06:00:01 DBG: Thread Internal.8: Waiting for work...
Jun  4 06:00:01 DBG: Pool 'Validation': Thread #0 spawned
Jun  4 06:00:01 DBG: Pool 'Validation': Thread #1 spawned
Jun  4 06:00:01 DBG: Pool 'Validation': Thread #2 spawned
Jun  4 06:00:01 DBG: Thread Validation.0: Waiting for work...
Jun  4 06:00:01 DBG: Pool 'Validation': Thread #3 spawned
Jun  4 06:00:01 DBG: Thread Validation.1: Waiting for work...
Jun  4 06:00:01 DBG: Thread Internal.9: Waiting for work...
Jun  4 06:00:01 DBG: Thread Validation.2: Waiting for work...
Jun  4 06:00:01 DBG: Pool 'Validation': Thread #4 spawned
Jun  4 06:00:01 WRN: The validation has begun.
Jun  4 06:00:01 INF: Starting validation.
Jun  4 06:00:01 DBG: Pool 'Validation': Pushed task '/home/paul/tals/lacnic.tal'
Jun  4 06:00:01 DBG: Thread Validation.0: Claimed task '/home/paul/tals/lacnic.tal'
thread/thread_pool.c:218 (tasks_poll())
object/tal.c:690 (do_file_validation())
Jun  4 06:00:01 DBG: Pool 'Validation': Pushed task '/home/paul/tals/afrinic.tal'
Jun  4 06:00:01 DBG: Pool 'Validation': Pushed task '/home/paul/tals/apnic.tal'
Jun  4 06:00:01 DBG: Pool 'Validation': Pushed task '/home/paul/tals/arin.tal'
object/tal.c:695 (do_file_validation())
object/tal.c:699 (do_file_validation())
Jun  4 06:00:01 DBG: Pool 'Validation': Pushed task '/home/paul/tals/ripe.tal'
Jun  4 06:00:01 DBG: - Stop: false
Jun  4 06:00:01 DBG: - Working count: 1
Jun  4 06:00:01 DBG: - Thread count: 3
Jun  4 06:00:01 DBG: - Empty task queue: false
Jun  4 06:00:01 DBG: Pool 'Validation': Waiting for tasks to be completed
Jun  4 06:00:01 DBG: Thread Validation.3: Claimed task '/home/paul/tals/afrinic.tal'
thread/thread_pool.c:218 (tasks_poll())
object/tal.c:690 (do_file_validation())
object/tal.c:695 (do_file_validation())
object/tal.c:699 (do_file_validation())
object/tal.c:705 (do_file_validation())
object/tal.c:711 (do_file_validation())
object/tal.c:667 (__handle_tal_uri_sync())
object/tal.c:522 (handle_tal_uri())
object/tal.c:529 (handle_tal_uri())
object/tal.c:705 (do_file_validation())
object/tal.c:711 (do_file_validation())
object/tal.c:667 (__handle_tal_uri_sync())
object/tal.c:522 (handle_tal_uri())
object/tal.c:529 (handle_tal_uri())
object/tal.c:537 (handle_tal_uri())
object/tal.c:540 (handle_tal_uri())
object/tal.c:537 (handle_tal_uri())
object/tal.c:540 (handle_tal_uri())
Jun  4 06:00:01 DBG: Thread Validation.2: Claimed task '/home/paul/tals/apnic.tal'
thread/thread_pool.c:218 (tasks_poll())
object/tal.c:690 (do_file_validation())
object/tal.c:695 (do_file_validation())
object/tal.c:699 (do_file_validation())
object/tal.c:705 (do_file_validation())
object/tal.c:711 (do_file_validation())
object/tal.c:667 (__handle_tal_uri_sync())
object/tal.c:522 (handle_tal_uri())
object/tal.c:529 (handle_tal_uri())
object/tal.c:537 (handle_tal_uri())
object/tal.c:540 (handle_tal_uri())
Jun  4 06:00:01 DBG: Thread Validation.1: Claimed task '/home/paul/tals/arin.tal'
thread/thread_pool.c:218 (tasks_poll())
object/tal.c:690 (do_file_validation())
Jun  4 06:00:01 DBG: Thread Validation.4: Claimed task '/home/paul/tals/ripe.tal'
dorpauli commented 3 years ago

New log file: 2021-06-07_06-00.log

It's very noticeable, that if the bug occurs, it's 6 am.

ydahhrk commented 3 years ago

Ok. The output tells me that, most likely, Fort is somehow misusing the libcurl library, or this is a libcurl bug.

What's your libcurl version?

This is how I found mine:

$ apt list --installed | grep libcurl
libcurl3-gnutls/focal-updates,focal-security,now 7.68.0-1ubuntu2.5 amd64 [installed,automatic]
libcurl3-gnutls/focal-updates,focal-security,now 7.68.0-1ubuntu2.5 i386 [installed,automatic]
libcurl4-openssl-dev/focal-updates,focal-security,now 7.68.0-1ubuntu2.5 amd64 [installed]
libcurl4/focal-updates,focal-security,now 7.68.0-1ubuntu2.5 amd64 [installed,automatic]

It's very noticeable, that if the bug occurs, it's 6 am.

Ok. This is probably the reason why I haven't been able to reproduce it. What's your time zone?

dorpauli commented 3 years ago

This is what i have installed. And i'm at CEST (currently UTC + 0200).

libcurl3-gnutls/focal-updates,focal-security,now 7.68.0-1ubuntu2.5 amd64  [Installiert,automatisch]
libcurl4-openssl-dev/focal-updates,focal-security,now 7.68.0-1ubuntu2.5 amd64  [installiert]
libcurl4/focal-updates,focal-security,now 7.68.0-1ubuntu2.5 amd64  [Installiert,automatisch]

PS: Some day the bug occurs also on other times, as you can see some comments above.

ydahhrk commented 3 years ago

PS: Some day the bug occurs also on other times, as you can see some comments above.

It's ok; I just want to increase the probability of triggering it.

Debugging libcurl is going to be a project.

ydahhrk commented 3 years ago

I might have fixed the bug. There's a loose end that's going to receive a fresh review tomorrow (my eyes hurt right now), but if it's what I suspect it is, it will turn out to be a non-issue. Fingers crossed.

So, solid commit tomorrow, but if you want to run what's already uploaded, it might lead to valuable output.

dorpauli commented 3 years ago

It's a thing of five min to take use of the new code, so yes i did. Just tell me, if i can run new code and i will let it a tried.

I also crossed my fingers. Good luck.:smile:

ydahhrk commented 3 years ago

Today's review was not uplifting. The hypothesis does not entirely match the logs.

Although, it's somewhat possible the behavior of the bug might have changed: Instead of quitting early and writing no ROAs, Fort might take a normal amount of time, but still write no ROAs. It'd be helpful to get this result, at least.

I recently uploaded another test commit. I further disinfected the stack trace printing function (I noticed it was using operations that were probably fine but technically dangerous), and set up handlers for all 31 common signals except SIGKILL and SIGSTOP (because those don't allow it).

Aside: Would you be able to provide a core dump?

dorpauli commented 3 years ago

I tried to get a core dump and with the settings from your link it worked for the provided example. Also with cron. So i think fort didn't crash like that?

But at 6 am CET this morning the output was empty again. Log file: 2021-06-14_06-00.log

ydahhrk commented 3 years ago

Well. We finally have a stack trace.

SIGSYS The SIGSYS signal is sent to a process when it passes a bad argument to a system call. In practice, this kind of signal is rarely encountered since applications rely on libraries (e.g. libc) to make the call for them. SIGSYS can be received by applications violating the Linux Seccomp security rules configured to restrict them. SIGSYS can also be used to emulate foreign system calls, e.g. emulate Windows system calls on Linux.[18]

https://en.wikipedia.org/wiki/Signal_(IPC)#SIGSYS

Fort is getting killed by that signal. And the signal is being produced by a bug in either Fort, libcurl, libssl, libcrypto, pthread or libc.

Guess I'm going to need their version numbers. Do you know how to find them out?

Alternate question: Are you using the latest Ubuntu, without any special ppas?

dorpauli commented 3 years ago

Ubuntu version: 20.04.2 LTS (GNU/Linux 5.4.0-74-generic x86_64)

apt list --installed | grep -E "libcurl|libssl|libcry|libpth|pth|libc":

klibc-utils/focal,now 2.0.7-1ubuntu5 amd64  [Installiert,automatisch]
libc-bin/focal-updates,now 2.31-0ubuntu9.2 amd64  [Installiert,automatisch]
libc-dev-bin/focal-updates,now 2.31-0ubuntu9.2 amd64  [Installiert,automatisch]
libc6-dbg/focal-updates,now 2.31-0ubuntu9.2 amd64  [Installiert,automatisch]
libc6-dev/focal-updates,now 2.31-0ubuntu9.2 amd64  [Installiert,automatisch]
libc6/focal-updates,now 2.31-0ubuntu9.2 amd64  [Installiert,automatisch]
libcanberra0/focal,now 0.30-7ubuntu1 amd64  [Installiert,automatisch]
libcap-ng0/focal,now 0.7.9-2.1build1 amd64  [Installiert,automatisch]
libcap2-bin/focal,now 1:2.32-1 amd64  [Installiert,automatisch]
libcap2/focal,now 1:2.32-1 amd64  [Installiert,automatisch]
libcbor0.6/focal,now 0.6.0-0ubuntu1 amd64  [Installiert,automatisch]
libcc1-0/focal-updates,focal-security,now 10.2.0-5ubuntu1~20.04 amd64  [Installiert,automatisch]
libcom-err2/focal,now 1.45.5-2ubuntu1 amd64  [Installiert,automatisch]
libcrypt-dev/focal,now 1:4.4.10-10ubuntu4 amd64  [Installiert,automatisch]
libcrypt1/focal,now 1:4.4.10-10ubuntu4 amd64  [Installiert,automatisch]
libcryptsetup12/focal-updates,now 2:2.2.2-3ubuntu2.3 amd64  [Installiert,automatisch]
libctf-nobfd0/focal-updates,focal-security,now 2.34-6ubuntu1.1 amd64  [Installiert,automatisch]
libctf0/focal-updates,focal-security,now 2.34-6ubuntu1.1 amd64  [Installiert,automatisch]
libcurl3-gnutls/focal-updates,focal-security,now 7.68.0-1ubuntu2.5 amd64  [Installiert,automatisch]
libcurl4-openssl-dev/focal-updates,focal-security,now 7.68.0-1ubuntu2.5 amd64  [installiert]
libcurl4/focal-updates,focal-security,now 7.68.0-1ubuntu2.5 amd64  [Installiert,automatisch]
libklibc/focal,now 2.0.7-1ubuntu5 amd64  [Installiert,automatisch]
libnpth0/focal,now 1.6-1 amd64  [Installiert,automatisch]
libssl-dev/focal-updates,now 1.1.1f-1ubuntu2.4 amd64  [installiert]
libssl1.1/focal-updates,now 1.1.1f-1ubuntu2.4 amd64  [Installiert,automatisch]
linux-libc-dev/focal-updates,focal-security,now 5.4.0-74.83 amd64  [Installiert,automatisch]
ldconfig -v | grep -E "libcurl|libssl|libcryp|pth|libc":

    libnpth.so.0 -> libnpth.so.0.1.2
    libcom_err.so.2 -> libcom_err.so.2.1
    libcap.so.2 -> libcap.so.2.32
    libcryptsetup.so.12 -> libcryptsetup.so.12.5.0
    libcrypt.so.1 -> libcrypt.so.1.1.0
    libc.so.6 -> libc-2.31.so
    libcap-ng.so.0 -> libcap-ng.so.0.0.0
    libssl.so.1.1 -> libssl.so.1.1
    libcc1.so.0 -> libcc1.so.0.0.0
    libcurl.so.4 -> libcurl.so.4.6.0
    libctf.so.0 -> libctf.so.0.0.0
    libcurl-gnutls.so.4 -> libcurl-gnutls.so.4.6.0
    libctf-nobfd.so.0 -> libctf-nobfd.so.0.0.0
    libpthread.so.0 -> libpthread-2.31.so
    libcbor.so.0.6 -> libcbor.so.0.6.0
    libcrypto.so.1.1 -> libcrypto.so.1.1
dorpauli commented 3 years ago

Since yesterday it has happened 3 times.

grafik

2021-06-14_21-00.log 2021-06-15_06-00.log 2021-06-15_07-00.log

ydahhrk commented 3 years ago

I don't think this will fix the problem, but I just uploaded two new commits.

issue49, as usual.

dorpauli commented 3 years ago

It's how you expected. The log from this morning:

2021-06-18_06-00.log

ydahhrk commented 3 years ago

Ok. This time, I have a theory that fits like a ring.

It's a temporary I/O error which happens to be blowing up in a nonconventional way, for what I can best describe as dumb historical reasons.

It should now be catched by --http.retry.count, just like all other I/O errors.

Even if it works, it would be great if you would provide a few 6 AM logs, so I can make sure the error is being handled gracefully. I still can't reproduce it, after all.

dorpauli commented 3 years ago

It is not 6am on everyday. Yesterday it happened at 11:30 PM... So i can not guarantee, that i catch a log where the error is handled. I can just let it run and tell you, if the bug doesn't happen anymore.

ydahhrk commented 3 years ago

Yes, but when it happens, it's usually at 6 AM. By reading the log, I want to make sure I fixed the bug quirklessly.

I/O errors are varied and can happen at any time, so if we get an HTTP retry at 6 AM specifically, that would have a decent chance of yielding the "patched" log.

I suggest grepping 6 AM logs for the following (incomplete) string:

Retrying HTTP request
dorpauli commented 3 years ago

It's very surprising. I grepped the string on every log this morning and exactly 6 AM gave an output:

for i in $(ls log/2021-06-23*); do echo $i; cat $i | grep "Retrying HTTP request"; done
log/2021-06-23_00-00.log
log/2021-06-23_00-30.log
log/2021-06-23_01-00.log
log/2021-06-23_01-30.log
log/2021-06-23_02-00.log
log/2021-06-23_02-30.log
log/2021-06-23_03-00.log
log/2021-06-23_03-30.log
log/2021-06-23_04-00.log
log/2021-06-23_04-30.log
log/2021-06-23_05-00.log
log/2021-06-23_05-30.log
log/2021-06-23_06-00.log
Jun 23 06:00:07 WRN [Validation]: https://rpki.apnic.net/repository/apnic-rpki-root-iana-origin.cer: Retrying HTTP request 'https://rrdp.apnic.net/notification.xml' in 5 seconds, 2 attempts remaining.
Jun 23 06:00:16 WRN [Validation]: https://rpki.apnic.net/repository/apnic-rpki-root-iana-origin.cer: Retrying HTTP request 'https://rrdp.apnic.net/notification.xml' in 5 seconds, 1 attempts remaining.
log/2021-06-23_06-30.log
log/2021-06-23_07-00.log
log/2021-06-23_07-30.log
log/2021-06-23_08-00.log
log/2021-06-23_08-30.log

The CSV file is not empty by the way. So well done i think. 😊

So here's the quite big log on this one. 2021-06-23_06-00.log

ydahhrk commented 3 years ago

The log makes sense. I think we're finally done.

Thank you very much for your patience and continuous availability. Feel free to close the bug if you don't see other problems.

dorpauli commented 3 years ago

The log makes sense. I think we're finally done.

This is good news.

Thank you very much for your patience and continuous availability. Feel free to close the bug if you don't see other problems.

You're welcome. I am glad that the problem could be solved. In the meantime I had already doubted myself and the server, I thought maybe there is no bug in the software.

There were no more problems since my last comment. I close the issue now. Good job. 👍🏻