add datastore resets - Githubissues

coryschwartz commented 1 year ago

This does not use the archive service locks, but will do datastore resets on a schedule.

travisperson commented 1 year ago

Why is this part of the lotus-chain-export helm chart?

coryschwartz commented 1 year ago

@travisperson -- that was an oversight. I moved this to the correct chart now.

ognots commented 1 year ago

This dashboard, shows the average execution time is 2.2h (max 2.9h) over last 7 days, but that's only because we had an outage caused by leaving the datastore reset too late.

The active jobs from this dashboard still show most job execution in under an hour, with some at around 1.5h

Given this data

we can tolerate a job failure caused by this cronjob and still stay within SLA
the risk of delaying landing threatens our SLA far more

My vote is to merge this, as is, and add the node locker functionality if needed at a later time.

travisperson commented 1 year ago

The active jobs from this dashboard still show most job execution in under an hour, with some at around 1.5h

I don't think this is correct. Which graph are you using to draw the conclusion that most job execution is under an hour? If you are looking at the Active Jobs dashboard it's showing that there is always at least one job running and for 30 minutes there are two, which would indicate that the execution is around 150 minutes or 2.5 hours

ognots commented 1 year ago

@travisperson good call, I was reading the graph wrong. @coryschwartz please check the nodelocker. using that wait function you added is likely the best solution to keep logic out of the clean up job https://github.com/filecoin-project/filecoin-chain-archiver/pull/52

coryschwartz commented 1 year ago

@travisperson I added the locker in, and it works well enough.

the reset container outputs the following:

waiting for lock
lock acquired. executing reset on lotus-a-lotus-0
Defaulted container "daemon" out of: daemon, temp-jwt (init), keystore-transfer-jwt (init), temp-libp2p (init), keystore-transfer-libp2p (init), keystore-verifier (init), chain-import (init)
pod "lotus-a-lotus-0" deleted
reset executed on lotus-a-lotus-0 with exit codes 0 0

and the acquire-lock container does the following.

acquiring lock for lotus-a-lotus-0 with password reset-790f-7af0
peerid:lotus-a-lotus-0, acquired:true, expiry:2022-11-30 10:09:01.856816254 +0000 UTC
lock acquired. Waiting for reset to complete.
reset complete. exiting.

and of course the fullnode container is reset

By the way, we will need to change these CronJobs to v1 not v1beta1 when we move this to the new clusters. I had to do this for my local tests.

filecoin-project / helm-charts

add datastore resets #180