Closed coryschwartz closed 1 year ago
Why is this part of the lotus-chain-export helm chart?
@travisperson -- that was an oversight. I moved this to the correct chart now.
This dashboard, shows the average execution time is 2.2h (max 2.9h) over last 7 days, but that's only because we had an outage caused by leaving the datastore reset too late.
The active jobs from this dashboard still show most job execution in under an hour, with some at around 1.5h
Given this data
My vote is to merge this, as is, and add the node locker functionality if needed at a later time.
The active jobs from this dashboard still show most job execution in under an hour, with some at around 1.5h
I don't think this is correct. Which graph are you using to draw the conclusion that most job execution is under an hour? If you are looking at the Active Jobs
dashboard it's showing that there is always at least one job running and for 30 minutes there are two, which would indicate that the execution is around 150 minutes or 2.5 hours
@travisperson good call, I was reading the graph wrong. @coryschwartz please check the nodelocker. using that wait function you added is likely the best solution to keep logic out of the clean up job https://github.com/filecoin-project/filecoin-chain-archiver/pull/52
@travisperson I added the locker in, and it works well enough.
the reset
container outputs the following:
waiting for lock
lock acquired. executing reset on lotus-a-lotus-0
Defaulted container "daemon" out of: daemon, temp-jwt (init), keystore-transfer-jwt (init), temp-libp2p (init), keystore-transfer-libp2p (init), keystore-verifier (init), chain-import (init)
pod "lotus-a-lotus-0" deleted
reset executed on lotus-a-lotus-0 with exit codes 0 0
and the acquire-lock
container does the following.
acquiring lock for lotus-a-lotus-0 with password reset-790f-7af0
peerid:lotus-a-lotus-0, acquired:true, expiry:2022-11-30 10:09:01.856816254 +0000 UTC
lock acquired. Waiting for reset to complete.
reset complete. exiting.
and of course the fullnode container is reset
By the way, we will need to change these CronJobs to v1
not v1beta1
when we move this to the new clusters. I had to do this for my local tests.
This does not use the archive service locks, but will do datastore resets on a schedule.