HathorNetwork / hathor-explorer-service

MIT License
1 stars 3 forks source link

[Design] Time Series Data #149

Closed lucas3003 closed 2 years ago

lucas3003 commented 2 years ago

Summary

TimeSeries Project will introduce a new way users will deal with data. They will be able to check dashboards with Token, NFTs, Transaction, Blocks, and Hash Rate information. HTR deposited for mint will not be covered on this document.

This project, however, brings challenges that are beyond its initial scope. For a long-term solution, we would need to rearrange how data is stored and how services are organized in order to get the best of our data. This rearrangement would need much more dev days than expected for this project. Therefore, we are proposing a short-term solution, changing data the minimal necessary for this implementation.

Currently, two stacks are candidates for managing the data: ElasticSearch and Prometheus. Both are part of our stack, can be utilized for time series data and easily integrate with the UI (Using Kibana and Grafana, respectively).

Acceptance Criteria

1 - The Explorer must show the Time Series data of Transactions, Blocks, Tokens, NFTs, and Hash Rates. 2 - Users can check the data from its beginning until the current date. - From 12/31/2019 on testnet and from 1/3/2020 on mainnet. 3 - Users can choose a time range of data. 4 - For Tokens, Transactions, NFTs, and Blocks, two graphs must be rendered: Accumulated sum of items over time and number of items created by period of time. Depending on the time range, the aggregation must change automatically. See image below for a visual representation. 5 - For HashRate, one graph must be rendered: Average of hash rate by a period of time. 6 - In the header, we should render a message stating "This screen is updated until block at height X and the last update was on timestamp Y", so users can know the last time the graph was updated. 7 - By default, we must render data from last week until now. 8 - This document must result in a short-term solution as we parallelly work on a scalable solution for this demand.

Visual representation of acceptance criteria 4 (Only for Tokens, consider 2 more dashboards for NFTs, Transactions, Blocks, and 1 more for Hash Rate): Screen Shot 2022-05-10 at 11 35 29

Alternatives

Prometheus

Prometheus is currently used for monitoring our services and full-nodes. We already use a Grafana dashboard to help us visualize the data. Prometheus works pulling data from exporters. In our case, there is an exporter on the full-node and on the Hathor Wallet Service. Both can be used as a source of truth for the domain of this project.

This is the high-level diagram if we use Prometheus:

TimeSeries-ShortTerm-Prometheus drawio

Description of steps:

1 - Prometheus pulls data from the exporter. There are two options for getting data. Check the Pulling the Data section below for more information. 2 - Prometheus stores the data obtained from the exporter. There are two options for data storage. Check the Data Storage section below for more information. 3 - User, accessing Explorer, requests Time Series data. 4 - Grafana, embedded on the page, gets data from Prometheus.

Pulling the data

We might get data directly from the full node exporter or use the Wallet Service exporter. Both would require development.

Data Storage

The first option to get data is from the local storage, the default mode provided by Prometheus and what we currently use for monitoring our systems. However, Prometheus itself says that “Prometheus's local storage is not intended to be durable long-term storage”[1][2]. Prometheus default retention policy is 15 days (Which can be changed), and we have more than 2 years of data, so we consider we need long-term storage.

Alternatively, it is possible to attach remote storage to Prometheus service. One of the most famous options is InfluxDB[3]. Being InfluxDB or not, we would need to introduce a new tech component on our stack, which is not the intention with this short-term solution and would add extra costs. A performance comparison was made[4] and summarized that InfluxDB outperforms ElasticSearch on ingest performance, but the mean query response time of ElasticSearch is 11.54x faster than InfluxDB.

Backfilling Data

As exporters only provide the latest data, we would need to create a script to backfill the old data on Prometheus. According to this article, we would need to:

  1. Create a file following the OpenMetrics format.
  2. Run the backfill command inside Prometheus passing the file we created as a parameter.
  3. Restart Prometheus server.

Steps 2 and 3 should be quick, but step 1 requires script development and reading data from Wallet Service SQL database. This would add 2 dev days to the project.

Pros and Cons

As advantages of using Prometheus, we have:

As disadvantages, we have:

Conclusion

Although Prometheus is an excellent tool for monitoring systems, it does not meet the criteria of this project, especially because of the data storage concerns we presented above.

ElasticSearch

ElasticSearch is a search engine that was recently introduced on Hathor’s tech stack. External sources (Like Logstash) must push data into the cluster, where indexes are created and data is mapped. In front of ElasticSearch data, we can use Kibana to create dashboards and import them directly on Hathor Explorer (Similarly to Grafana).

This is the high-level diagram if we use ElasticSearch:

TimeSeries-ShortTerm-Elastic drawio

Description of steps:

1 - Logstash makes a select statement on both the Transaction and Token tables to transfer the data to ElasticSearch. Note: 1a is already implemented 2 - Logstash handles the data and sends it to the correct index. In the case of Transaction Pipeline, we will need to evaluate if the transaction is a block. If it is, send the information to the Block index, including a new field for Hash Rate If the transaction is not a block, send it to the Transaction index. Note: 2a is already implemented. 3 - Users, via Hathor Explorer, access a Kibana iframe. 4 - Kibana directly accesses the cluster querying for information.

Backfilling weight data

Currently, Hathor Wallet Service does not store weight data from Transactions and Blocks. We will need to:

  1. Change the Hathor Wallet Service database to add a weight column on Transaction table.
  2. Make the sync algorithm get the weight from the full node and insert on the table.
  3. Create a script to backfill the weight of old Transactions and Blocks.

Pros and Cons

As advantages of using ElasticSearch, we have:

As disadvantages, we have:

Conclusion

Our application will not be ingest-heavy at the beginning. Also, we tolerate data to take seconds or even minutes to be ingested. On the other hand, response time is much more important, as many users may be accessing the data from Hathor Explorer. Therefore, we consider ElasticSearch to be the best short-term solution for this use case.

Task Breakdown

Considering ElasticSearch as the best tool for the solution, we will have the tasks:

Wallet Service - Total: 2.1 dev/days

Logstash - Total: 2 dev/days

ElasticSearch - Total: 1.6 dev/days

Kibana - Total: 0.8 dev/days

Explorer - Total: 1.6 dev/days

Explorer Service - 0.7 dev/days

Total: 8.8 dev/days.

References:

[1] - https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects [2] - https://stackoverflow.com/questions/68891824/why-prometheus-is-not-suitable-for-long-term-storage [3] - https://www.influxdata.com/ [4] - https://jolicode.com/blog/influxdb-vs-elasticsearch-for-time-series-and-metrics-data [5] - https://medium.com/tlvince/prometheus-backfilling-a92573eb712c

luislhl commented 2 years ago

have basic aggregation features.

Can you give one example of an aggregation feature? Like aggregation by time periods like month, week, etc?

The Explorer must show the Time Series data of Transactions, Blocks, Tokens and Hash Rates.

Just the total number of each one by time?

For the tokens, will we want to have the number of NFTs as well? Or just any kind of token?

Using Wallet Service is preferred because it has treatments for reorgs or other errors that can happen inside the full-node.

If the main concern are reorgs, I think the full-nodes will handle them as well, for the sake of the metrics we need to collect.

I don't see any real difference between them in this field. If there is one, maybe you should detail the description a little more.

By thinking of a reorg flow, it seems to me that both solutions would behave exactly the same. For instance, thinking of a reorg that could change the number of blocks in the accepted chain.

  1. Prometheus collects the current number of blocks from the [full-node|wallet-service]
  2. The reorg happens, making the [full-node|wallet-service] update its DB
  3. Prometheus collects the new number of blocks after the reorg from the [full-node|wallet-service]

Both the full-node and the wallet-service should return the same information in each of the steps, right?

Extra development for backfilling the data

Maybe you could explain this a little more. You only mention this one time without explanation.

I know what this is about just because you talked to me on Slack, but for others it may not be so clear.

make a hash rate calculation and send the result to the Hash Rate index

Do we have confirmation that this is really feasible with Logstash? And what would be the calculation?

Users, via Hathor Explorer, access a Kibana iframe.

I think there is a concern here about someone trying to abuse the iframe to DDoS the service, right?

Do they have any built-in protection against this? Or can we do something about it?

I see that you have a task Make sure security, mapping, caching, and throttling configuration on ElasticSearch will support this new use case.. But these security and throttling measures were already mapped?


Additional comments

lucas3003 commented 2 years ago

Can you give one example of an aggregation feature? Like aggregation by time periods like month, week, etc?

I changed the design to include two visualizations: Accumulated data from a period of time and number of items created by period of time. I included a visual representation of how Tokens visualization would look like. The user will be able to change the time range, and the Kibana dashboard will automatically adjust the aggregation (By hour, day, week, month...)

Just the total number of each one by time?

For the tokens, will we want to have the number of NFTs as well? Or just any kind of token?

It will be the accumulated sum and total number of each by time. I included a visual representation to help undestading this part.

We won't have NFTs at this time, just any kind of token. NFT information is a quick to get, and we may be able to include that without extra effort.

Both the full-node and the wallet-service should return the same information in each of the steps, right?

Yes, you are right. I removed the part where I stated that getting the info from wallet-service is better than the full-node. Both are equal on this part.

Maybe you could explain this a little more. You only mention this one time without explanation.

I added extra sections about backfilling data on both Prometheus and ElasticSearch.

Do we have confirmation that this is really feasible with Logstash? And what would be the calculation?

It is feasible, and I will test our use case on the POC. This is possible using Ruby Filter Plugin. I will use it to add a field on Block index called hash_rate on Logstash, and the calculation will be 2**(block.weight - log(self.avg_time_between_blocks, 2)), where avg_time_between_blocks is 30.

Do they have any built-in protection against this? Or can we do something about it?

Elastic has built-in API Rate Limiting, which we will test during the POC. Also, currently this is not a big concern as this is a short-term solution and few features are running inside our ES Cluster (Tokens page and now the feature being introduced on this design)

would add a PoC as one of the tasks, to make sure things work as we expect.

I added the PoC as a task.

pedroferreira1 commented 2 years ago

My main concern with the design is about the security of using Kibana's iframe. Did you do any PoC with that? Can users access the data somehow? Or change the Kibana dashboard? How is the cache of that and rate limits?

For the tokens, will we want to have the number of NFTs as well? Or just any kind of token?

From Luis question here, I feel we could have a dashboard with the number of NFTs, if possible.

Pulling the Data We might get data directly from the full node exporter, which is already built, or use the Wallet Service exporter, which would require development

From full node exporter we also would need development because we don't have the tokens metrics there.

It consumes much more disk space. (4.49x more disk than what InfluxDB needs, for comparison purposes). Our current use case running on ElasticSearch consumes 894MB out of 60GB we have combined in two regions.

What happens if the storage is fully consumed? Do we have alerts for that?

Our application will not be ingest-heavy at the beginning. Also, we tolerate data to take seconds or event minutes to be ingested. On the other hand, response time is much more important, as many users may be accessing the data from Hathor Explorer. Therefore, we consider ElasticSearch to be the best short-term solution for this use case.

I agree with the conclusion.

It's still not clear to me how will be the explorer UI. We are going to have two charts for Tokens, Transactions and Blocks, and one chart for Hash Rate, is that correct? If we add NFT separately, so two more. Are we going to add orphan blocks in this blocks index, or just the height of the blockchain? (I think we should have the height).

Also about the chart, what's the default UI? Last week of data would be my first choice.


This is one more feature that will depend on the wallet service sync (that later will be changed to the data service sync). Should we use maybe one more dev day to create a mechanism in the Elastic Search, so we can get the latest timestamp update? So we can share in the explorer screen "This screen is updated until block at height X and the last update was on timestamp Y"?

I feel it's more interesting than just removing the feature like we are doing in the Tokens API (which is working perfectly before we release). What do you think?

lucas3003 commented 2 years ago

My main concern with the design is about the security of using Kibana's iframe. Did you do any PoC with that? Can users access the data somehow? Or change the Kibana dashboard? How is the cache of that and rate limits?

I will confirm that on the PoC, but according to this ElasticSearch PR, we will need to setup an anonymous user that will be used by Kibana to login, and then we define read-only privileges on the indexes we want.

From Luis question here, I feel we could have a dashboard with the number of NFTs, if possible.

Yes, I agree. We should not have extra development. I added that to the design.

From full node exporter we also would need development because we don't have the tokens metrics there.

Thanks! I updated the design with this information.

What happens if the storage is fully consumed? Do we have alerts for that?

We do have alerts configured here. We are currently using 894MB out of 60GB we have available.

We are going to have two charts for Tokens, Transactions and Blocks, and one chart for Hash Rate, is that correct?

Yes, in total we will have 9 charts (2 for tokens, 2 for transactions, 2 for NFTs, 2 for blocks, 1 for hash rate). I updated the visual representation description.

Should we use maybe one more dev day to create a mechanism in the Elastic Search, so we can get the latest timestamp update? So we can share in the explorer screen "This screen is updated until block at height X and the last update was on timestamp Y"?

It is possible just adding a new API on Explorer Service that call ElasticSearch for the last @timestamp and transaction_timestamp. One point is feasible for that. I added that on our design.

pedroferreira1 commented 2 years ago

I will confirm that on the PoC, but according to https://github.com/elastic/kibana/pull/79985, we will need to setup an anonymous user that will be used by Kibana to login, and then we define read-only privileges on the indexes we want.

That's great. I feel we should start with this PoC just to make sure everything will work as we expect.

For me this is approved.

luislhl commented 2 years ago

:heavy_check_mark: for me as well