Indexer cumulative strategic constraints

emg110 commented 3 years ago

Algorand indexer is a great product which serves very good in progress and growth of Algorand and like all other good software solutions, indexer is in need of continues improvement. This issue tries to point to some of those improvements.

Problem

These are, in order of criticality, majorly expressed concerns of developer and enthusiast user community, as I could aggregate from different channels (comments are most welcome):

1- Size: Archival node (node plus indexer) total volume exceeds one terabyte now and a great portion of this volume exists in both node & indexer PG database.

2- Performance: The SELECT query performance is not satisfactory and leads to timeout in some cases (especially with unwindowed queries, big aggregations and search ops).

3- Breaking changes: Having two sources of truth and schemas has and will lead to service breaks during change cycles.

4- Population time: For archival nodes it's already beyond the borders of reason and hence a rapid solution on this would be paramount.

5- Scalability: With current code base and usage of PG ,only a few fraction of dev community can scale horizontally easily. That's near to impossible for ordinary users!

6- PUBSUB: Instead of only http poling via REST endpoints, presence of a PUB/SUB daemon and endpoints are strongly recommended and demanded by community not to be forced to implement schedules for http poling.

Solution

Data_Center_Trends_Header

The solution is already there and is called BigData technologies ecosystem solutions (Mostly found in Apache community) including but not limited to:

1- Using universal Distributed Data lake technologies accompanied by data formats designed for distributed BigData realm (Apache Arrow (IPC), Apache Parquet, Apache Avro,...), ready for wire. 2- Avoid serialization/deserialization as much as possible and use more wire native/friendly formats. 3- Avoid row oriented solutions , columnar is the ruling sovereign in BigData and tabular data is just a representation, when required and play best with distributed storage scenarios. Note: Tabular (row oriented) data is only good for us humans and not that much friendly for machines!
4- Use Zero-copy and high-reducdancy distributed data approaches to maintain scalability. 5- Keep it simple and stupid when it comes to big data:

Simple human readable data structures (file system or NAS partitions (directories) or S3 API object storage buckets along with files with meaningful names). E.g. Every 1000 round in a folder or bucket named by first round-last round convention with all 1000 round in a file.
Backup/Restore as simple as copy paste or any other file backup.
What you see is what you compute (WYSWYC) is the best! Minimum overhead in data journey from Disk to Memory and CPU.
Keep you SQL close but your data closer: SQL abilities and features are not first priorities and concerns in BigData, the data and how to summon it. That's why we witness many SQL-like grammars and dialects on the turf. Algorand, in general, has a very simple yet powerful data schema and this comes as a very strategic advantage if combined with a very simple yet effective querying service module that would echo that simplicity and power with an unlimited yet simple syntax which gets the job done and does it blazing fast.

6- Start planning on harnessing power of GPU computation and benefits of vector data (in indexing for example) as an option for those with a GPU. 7- ETL of data from Algod to PG takes a long time currently and considering size of data and the time it takes , there is certainty room for improvement. 8- Harness the power of LLVM IRs & JIT for maximum performance.

Dependencies

1- BigData experts and technology ecosystem. 2- Data lakes and Object storage experts and toolsets. 3- Distributed | Parallel processing experts and creative yet simple ideas. 4- Decentralized storage experts to solve the problem of making fast catchups decentralized as well (This is not a problem in technical nature but from users point of view it's some how resembling centralization because of centralized storage of these fast catchup points on S3).

emg110 commented 3 years ago

Any comments, suggestions, criticism or objections are appreciated and encouraged!

emg110 commented 3 years ago

Quick summary:

Problems: 1- Size 2- Performance 3- Breaking changes 4- Population time 5- Scalability 6- REST only (No pub/sub)

Solutions: 1- BigData Technologies 2- Apache Arrow Technologies 3- Data Lakes of Arrow/Parquet partitions 4- Simplification of scalability and management operations 5- Adding pub/sub daemon

algoanne commented 2 years ago

thanks for the contribution, @emg110 ! These are great problems to point out, we've noted your feedback and we're actively thinking about a lot of this ourselves. I'm closing this issue since it's not actionable as a ticket, but look forward to getting your thoughts as we make progress on some of the above areas.

emg110 commented 2 years ago

Indexter is coming addressing all mentioned in this ancient ticket. This was how it all started to keep a record of when they were started all.

algorand / indexer