Estuary Stability

Author	Alvin Reyes
Status	In-progress
Revision

Overview

These are the things we need to accomplish to get Estuary to the alpha and post alpha stage. I’d like to look at each as Pillars with each being built and should perfectly levelled to stablize the Estuary platform.

This is all Tech. No productization / product lifecycle steps here.

Github Project: https://github.com/orgs/application-research/projects/7/views/5

Priority	Improvements Issues	Conversations / Discussions
1	System Errors	All systems error that estuary encounters
2	Infrastructure	All the action items we need to do to stablize the infrastructure along with the code changes
3	Data Clean up	Any stale data we need to remove or clean up
4	Debugging	All the action items we need to do to debug or provide us more information on how to debug.
5	Functional	All functional / design / code that needs to be optimized and improved
6	Support	All the action items I think we need to do ensure we have the proper customer support

System Errors (Panics)

All on it’s own page. We need to handle all the panics.

Log file:

log_file_from_shuttle6

msg":"couldnt decode pid

pinning queue error: context canceled\nfailed to walk DAG\nmain.

failed to handle rpc command: Unable to send restart request: exhausted 5 attempts but failed to open stream to

pinning queue error: context deadline exceeded\nfallback provide failed\nmain

tried to add pin for content we failed to pin previously

failed to handle rpc command: failed to compute commP

failed to handle rpc command

Infrastructure

[ ] Grafana agent on ansible so we can source out the storage of the logs to grafana. We save up some space if we do so. We do have this on shuttles, but not enabled properly. https://filecoinproject.slack.com/archives/C016APFREQK/p1665703251824289
[ ] Install / Enable agents on all shuttles - enabling these agents will ensure that logs are stored on grafana only.
[ ] Document the release and deployment (https://www.notion.so/Estuary-Infrastructure-40ddc4cd518d478a81b76f5c0df1a276)
[ ] Troubleshooting guide for the infrastructure - I’ll be adding more information on this.
[ ] Back up and restore (enable data and blockstore backups) - I’d like to work with infrastructure point of content for this.
[ ] Infra improvement: Dockerize all components
[ ] Infra improvement: Create a simple kube cluster for POC
[ ] E2E Test Env: estuary + lotus + boost

Data Clean up

https://filecoinproject.slack.com/archives/C016APFREQK/p1660258369066179

[ ] Write an SQL script to remove the majority of the non-active pins (14m plus records) on shuttle-4, e.i Delete all non-active pins.
[ ] The negative impact of removing these records is that if some of the pins are on the blockstore then anyone who uses the /gw will fail to look up the CID since this gateway relies on the database record.
[ ] There might be some failed pins that are yet to be processed by the SP so lost of opportunity there.
[ ] Another solution is to create a clean up script on shuttle-4 to traverse thru the blockstore using the CIDs from the pins table, identify those that the shuttle can't "walk" - meaning it's in the database but not on the blockstore (using merkledag.Walk), and delete them on the database. It will be like a "estuary shuttle reconciler" tool to match the blockstore CID with the pins table.
[ ] Write scripts that can perform backups on specific filters.
[ ] Write SQL script to delete the CIDs that doesn’t exist on the blockstore of the local node.

Debugging

[ ] Enable developers that they have the proper debugging tools (GoLand).
[ ] Set up dedicated shuttles for each developer (for dev testing)
[ ] Enable pprof on all shuttles and api node
[ ] Enable grafana agents

Functional

[ ] Revisit the pinning mechanism
- [ ] We need to revisit the pinning process, specifically the infinite loops and initialization of workers to pin specific content. The current process right now is causing a build of unnecessary memory allocation on the PinningOperation which contributes to the OOM issue.
- [ ] https://www.notion.so/ecosystem-wg/Pin-Manager-use-a-disk-based-queue-757a79f9fc8d47b09a2f46112d2c423c
[ ] Revisit the queueing mechanism
- [ ] I’d like to explore the possibility of separating the queuing from the main api node. We had discussions on this before and I would like to revisit.
[ ] Revisit all the infinite for loops and check if we need to create intervals or optimize them.

Untitled

shuttle.go

handleShuttleMessages
autoretrieve.go
shuttle/main.go

RunRpcConnection

websocket connection handleRpcCmd

websocket.JSON.Send

addDatabaseTrackingToContent
handlers.go

addDatabaseTrackingContent (duplicate code)

websocket.JSON (duplicate code)

handleShuttleConnection
pinmgr.go

Run(workers int)
replication.go

runStagingBucketWorker

runDealWorker
trackbs.go
benchtest/main.go
AutoRetrieve

AR
[ ] Unit Tests (Quality Assurance) - there is a unit-tests branch that has placeholder of unit tests source files in go. I know it’s not the best thing to do so I think we should just collectively, slowly and piece by piece put up a “chore” commit to clean up and create unit tests as we go.
- [ ] Revisit: https://github.com/application-research/estuary/tree/unit-tests
[ ] Automated / Regression Tests - we should at least run the shell or postman jobs to run the API endpoint tests.
- [ ] Revisit if we can run a script to call postman passing the collection file using postman CLI: https://github.com/application-research/estuary/tree/master/tests