Closed emg110 closed 2 years ago
Any comments, suggestions, criticism or objections are appreciated and encouraged!
Quick summary:
Problems: 1- Size 2- Performance 3- Breaking changes 4- Population time 5- Scalability 6- REST only (No pub/sub)
Solutions: 1- BigData Technologies 2- Apache Arrow Technologies 3- Data Lakes of Arrow/Parquet partitions 4- Simplification of scalability and management operations 5- Adding pub/sub daemon
thanks for the contribution, @emg110 ! These are great problems to point out, we've noted your feedback and we're actively thinking about a lot of this ourselves. I'm closing this issue since it's not actionable as a ticket, but look forward to getting your thoughts as we make progress on some of the above areas.
Indexter is coming addressing all mentioned in this ancient ticket. This was how it all started to keep a record of when they were started all.
Algorand indexer is a great product which serves very good in progress and growth of Algorand and like all other good software solutions, indexer is in need of continues improvement. This issue tries to point to some of those improvements.
Problem
These are, in order of criticality, majorly expressed concerns of developer and enthusiast user community, as I could aggregate from different channels (comments are most welcome):
1- Size: Archival node (node plus indexer) total volume exceeds one terabyte now and a great portion of this volume exists in both node & indexer PG database.
2- Performance: The
SELECT
query performance is not satisfactory and leads to timeout in some cases (especially with unwindowed queries, big aggregations and search ops).3- Breaking changes: Having two sources of truth and schemas has and will lead to service breaks during change cycles.
4- Population time: For archival nodes it's already beyond the borders of reason and hence a rapid solution on this would be paramount.
5- Scalability: With current code base and usage of PG ,only a few fraction of dev community can scale horizontally easily. That's near to impossible for ordinary users!
6- PUBSUB: Instead of only http poling via REST endpoints, presence of a PUB/SUB daemon and endpoints are strongly recommended and demanded by community not to be forced to implement schedules for http poling.
Solution
The solution is already there and is called BigData technologies ecosystem solutions (Mostly found in Apache community) including but not limited to:
1- Using universal Distributed Data lake technologies accompanied by data formats designed for distributed BigData realm (Apache Arrow (IPC), Apache Parquet, Apache Avro,...), ready for wire. 2- Avoid serialization/deserialization as much as possible and use more wire native/friendly formats. 3- Avoid row oriented solutions , columnar is the ruling sovereign in BigData and tabular data is just a representation, when required and play best with distributed storage scenarios. Note: Tabular (row oriented) data is only good for us humans and not that much friendly for machines!
4- Use Zero-copy and high-reducdancy distributed data approaches to maintain scalability. 5- Keep it simple and stupid when it comes to big data:
first round-last round
convention with all 1000 round in a file.6- Start planning on harnessing power of GPU computation and benefits of vector data (in indexing for example) as an option for those with a GPU. 7- ETL of data from Algod to PG takes a long time currently and considering size of data and the time it takes , there is certainty room for improvement. 8- Harness the power of LLVM IRs & JIT for maximum performance.
Dependencies
1- BigData experts and technology ecosystem. 2- Data lakes and Object storage experts and toolsets. 3- Distributed | Parallel processing experts and creative yet simple ideas. 4- Decentralized storage experts to solve the problem of making fast catchups decentralized as well (This is not a problem in technical nature but from users point of view it's some how resembling centralization because of centralized storage of these fast catchup points on S3).