HathorNetwork / hathor-explorer-service

MIT License
1 stars 3 forks source link

[Design] Tokens API #123

Closed lucas3003 closed 2 years ago

lucas3003 commented 2 years ago

Problem and Opportunity

On Hathor Explorer, it is not trivial to find a particular token or a list of them. The only way to do this is by finding the token in a transaction. Then, the user can navigate to the token's detail page (Example here).

This creates an opportunity to enhance the searchability of tokens. Giving users the option to list all of them, in an ordered and paginated way, will lead to better advertising of those tokens. This can lead, ultimately, to more transactions and better usage of the platform.

Solution

There is a short-term and a long-term solution. The long-term solution consists in creating a new service that would be the authoritative source of Token data. This data is currently provided by Wallet Service. The short-term solution consists in relying on Wallet Service as the owner of Token data, bringing this data to Explorer Service and ingest it on a ElasticSearch, as we will demonstrate below.

High Level Design

High Level Design

Excepting the Tokens RDS database, the Metadata S3 bucket, the Error Queue, and the Slack trigger, everything will be created.

The steps 1 through 5 describe the synchronization process. Once a day (This frequency can be changed), Logstash will get changes from two sources: Tokens DB, which is the authoritative source of Tokens, and Token Metadata S3 bucket, which knows which Tokens are NFTs, and send them to ElasticSearch.

The steps 6 through 9 describe how the client (Explorer) will communicate and how this request will be passed to the ElasticSearch.

Caching

ElasticSearch will handle cache for us. We will set-up an additional layer of cache on API Gateway, that will have a TTL of 30 minutes.

ElasticSearch vs DynamoDB as Search Engine

For the Search Engine, ElasticSearch and DynamoDB were considered. This is why we chose ElasticSearch:

Wallet Service

On Wallet Service, two new columns will be added to the Tokens DB: modification_time and insertion_time. On migration file, all current Token will have current datetime as initial value. This is necessary to logstash to know which records to process.

Task Breakdown

Task Effort (dev-days)
Create migration file on Sequelize to include two new columns 2
Total 2

Costs

No cost will be added to Wallet Service

Hathor Explorer Service (Backend)

Hathor Explorer Service will be the place of most of back-end changes. As explained above, we will create the following resources:

API (Hathor Explorer Service - API Gateway):

GET /tokens?from=:from&query=:query&sort=:sort

From = Start the records from nth record. Query = Query made by the customer (Searching for UID, Name, Symbol, or Custom Token/NFT) Sort = Which field to sort and if asc or desc.

Task breakdown

Task Effort (dev-days)
Create new endpoint on API Gateway that will be called by Hathor Explorer (Frontend) 2
Create TokensAPI Handler 2
Create, configure, and test ElasticSearch service 2
Configure Logstash with JDBC and S3 input plugins 3
Create error flow for Logstash 2
Update Explorer Service documentation 0.5
Configure throttling and caching 0.5
Total 12

Costs

ElasticSearch

Different quotations were made for ElasticSearch.

Using AWS OpenSearch:

Instance Hardware Quantity Total cost (eu-central-1) Total cost (us-east-1)
Data Instance t2.small.search 2 61.32 USD/month 52.56 USD/month
Dedicated Master Instance t2.small.search 3 91.98 USD/month 78.84 USD/month
Free tier (12 months) 1 t2.small.search and 10GB storage 1 -30.66 USD/month -26.28 USD/month
Total - - 122.64 USD/month 105.12 USD/month

Using ElasticSearch Service (Calculator on https://cloud.elastic.co/pricing)

Service Zones Storage RAM Total cost (us-east-1) Total cost (eu-central-1)
ElasticSearch 2 30GB 1GB 64.28 USD/month 81.99 USD/month
Integrations Server 1 12GB 1GB 0 USD/month 0 USD/month
Kibana 1 - 1GB 0 USD/month 0 USD/month
Enterprise Search 1 - 2GB 0 USD/month 0 USD/month
Total - - - 73.36 USD/month 93.52 USD/month

Obs. This is under the Gold subscription plan. See more information about the plans here.

For cost estimation, we will consider the value of 73.36USD/month using the Elastic Cloud on us-east-1.

For comparison purposes, a DynamoDB table with cache (DAX) will cost approximately 30 USD/month using t3.small as DAX node.

Lambda

Considering that the Lambda function will handle 100,000 requests, we will have a total cost of 0.11USD/month.

API Gateway

API Gateway charges $3.70 per million requests. Considering 100,00 requests/month, we will have an estimated total cost of 0.37 USD/month

Logstash

Logstash will run as AWS Fargate tasks, using Amazon EFS as storage. To give customers a better experience, it will run continuously syncing the last data from RDS and S3 buckets.

Two task definitions will be created. One that will connect to RDS and other that will connect to the S3. If a task fails, the Logstash DLQ will be populated and the information will be sent to a SNS topic. If a DLQ message is sent to the SNS, a CloudWatch alarm will be triggered, which will trigger a Lambda function that will send a message to Slack, where dev team is tracking errors.

The continuously-running container will cost approximately 36 USD/month

Hathor Explorer (Frontend)

Hathor Explorer will also not have any impact on short-term or long-term solutions. It will simply get data from Hathor Explorer Service and render to the customer. This is the expected user interface:

Screen Shot 2022-02-09 at 17 52 33

Disclaimer: This is just a mock-up. The final result might not be exactly like this.

A new menu item, called "Tokens", was introduced in the navigation bar. Once the user clicks on it, a table of tokens will be rendered (Initially sorted by name asc). Then, the user can perform some actions:

Tasks

Task Effort (dev-days)
Create config for feature toggle 0.5
Create UI struct (Menu link, table, components for searching/pagination) 1
Retrieve data from Explorer Service 1
Make pagination and sorting 1
Total 3.5

Costs

No cost will be added to the Explorer, as it already runs on a S3 as a static website.

Tasks Consolidation

Service Effort (dev-days)
Hathor Wallet Service 2
Hathor Explorer Service 12
Hathor Explorer 3.5
Total 17.5

Stages

Excepting the ElasticSearch Cluster and the Logstash instances, all other resources will have different stages for testnet and mainnet. The cost increase will be only $0,48 USD/month, expecting 100k requests each month per stage.

Cost Consolidation

Service Resource Estimated monthly price
Hathor Explorer Service Elastic Cloud 73.36 USD (1 stage)
Hathor Explorer Service Logstash Container 36 USD (1 stage)
Hathor Explorer Service AWS API Gateway 0.74 USD (2 stages)
Hathor Explorer Service AWS Lambda 0.22 USD (2 stages)
Total - 110.32 USD

Monitoring

ElasticSearch will be monitored using the built-in Performancce metric provided by Elastic Cloud:

Screen Shot 2022-02-22 at 23 25 35

Logstash will be monitored enabling the Dead Letter Queue and send this information to Slack, using a infrastructure already built by the team.

Infrastructure

New Infrastructure will be coded using Terraform, which allows traceability in case we need to change the proposed components.

Conclusion

The result of this design will be a page where users can search and find custom tokens with less effort. The short-term solution mixes contexts, but enables us to deliver this feature quicker. A long-term solution was also proposed and must have its own design document.

lucas3003 commented 2 years ago

This Issue replaces this one: https://github.com/HathorNetwork/hathor-explorer-service/issues/73

luislhl commented 2 years ago

In addition to writing to a MySQL table, the daemon will now write to a SNS topic.

We need to have a way to detect errors in the process.

We will have alerts for errors in the explorer-service Lambdas, but is there any other component where errors could happen and we need to monitor? Like the daemon, maybe we want to alert if it is unable to send messages to the SNS topic.

Cache and other performance enhancements are given by ElasticSearch.

Although ElasticSearch has its internal caching to speedup things, we apparently could benefit from having a cache in the API Gateway layer as well, since some requests may repeat a lot (like just getting the most recent tokens)

What do you think?

For the data from Wallet Service, we will need to create a script that get all data from the MySQL table and transform it in a way that the SNS can handle.

At what moment will this script be run? Before we deploy the new sync process? Or After?

How will you deal with tokens that may get created in the edge moments? Like, if you run the script after deploying the sync, some tokens may already have been processed by the sync. Can this generate any problem?

For the data from S3, we will need to create a new script that query data already created, and put directly on the ElasticSearch.

You mentioned before that we only need the "is nft?" data from S3, right? So what this script will do is to edit the documents in ElasticSearch created by the other script? Or will it create new documents?

Create new lambda function that is triggered by API Gateway and call Wallet Service

I thought explorer-service would only react to the SNS topic. Why does it need to call Wallet Service?

Different quotations were made for ElasticSearch. Using AWS CloudSearch:

AWS CloudSearch is not Elasticsearch, is it?

I think you meant AWS OpenSearch

Dedicated Master Instance | t2.small.search | 2

It is not possible to run 2 master instances in OpenSearch, only an odd number (like 3). It is also not possible to run only 1.

Obs. This is under the Gold subscription plan

Why do you think we need it?


Other questions:

andreabadesso commented 2 years ago

I like the overall design and the architecture you described, but there is a problem that I don't think you considered when using the wallet service daemon:

If a transaction that created a token is voided for any reason (maybe a reorg), the daemon will not emit any event to undo it, today we handle it on the wallet service, on the onHandleReorg lambda

We could implement a method on the daemon to detect what transactions were voided when a reorg happens, but it is not straightforward, as you can see on the wallet service

That's why I suggested using the already processed database from the wallet service

Also, some notes:

Only the Wallet Service Daemon, that syncs the new transactions with the Full Node, will have changes. In addition to writing to a MySQL table

The wallet service daemon does not write to a MySQL table directly, it calls the onNewTxRequest lambda on every new transaction

Considering that each Lambda will handle 100,000 requests, we will have a total cost of 0.31USD/month.

We just have to be careful here with adjusting the timeouts for the lambdas, we've had a bad experience in the past on the wallet service on a lambda that had an incorrect timeout value (of 6 minutes) and was timing out on a big amount of requests

It seems to me that this is also a possibility as we are interacting with elastic search and it might be unavailable

Would also suggest to setup client timeouts lower than the lambda timeout so we can see it on the logs

lucas3003 commented 2 years ago

@luislhl

We need to have a way to detect errors in the process.

I have simplified the architecture, relying now mostly on Logstash. I will enable the Dead Letter Queue and forward any failed execution to our Slack channel, so we can process. The maximum impact of this is that the delay to show custom token will be higher.

we apparently could benefit from having a cache in the API Gateway layer as well, since some requests may repeat a lot

Great idea. I had this idea: We cache requests until Logstash syncs new tokens (Once a day). What do you think?

At what moment will this script be run? Before we deploy the new sync process? Or After?

I removed the necessity of this script. However, Logstash will run once a day, but we still need to determine which moment. I think it is better to run Logstash when we have the lowest traffic of the day.

You mentioned before that we only need the "is nft?" data from S3, right? So what this script will do is to edit the documents in ElasticSearch created by the other script? Or will it create new documents?

I plan to edit the documents created by the other script. Do you think it is better to create new documents?

I thought explorer-service would only react to the SNS topic. Why does it need to call Wallet Service?

I forgot to remove this effort. Thanks for noticing it.

AWS CloudSearch is not Elasticsearch, is it?

I think you meant AWS OpenSearch

Right, I fixed it.

It is not possible to run 2 master instances in OpenSearch, only an odd number (like 3). It is also not possible to run only 1.

I updated the costs using three instances. Now, I am considering Elastic Cloud the best option.

Why do you think we need it?

I considered Gold Plan because we have support on the same day. On the basic plan, the SLA is 3 business days.

lucas3003 commented 2 years ago

@andreabadesso

I simplified the architecture, removing the changes on the Daemon and getting the data directly from the RDS using Logstash.

I adjusted the lambda timeout to be 30 seconds.

luislhl commented 2 years ago

I have simplified the architecture, relying now mostly on Logstash. I will enable the Dead Letter Queue and forward any failed execution to our Slack channel, so we can process. The maximum impact of this is that the delay to show custom token will be higher.

Great. We have SNS topics already that send messages to Slack, you can use them if you need.

Great idea. I had this idea: We cache requests until Logstash syncs new tokens (Once a day). What do you think?

Seems good to me

I plan to edit the documents created by the other script. Do you think it is better to create new documents?

I think it is better to edit them.

I considered Gold Plan because we have support on the same day. On the basic plan, the SLA is 3 business days.

I have used Elastic Cloud before and I never needed to reach them for support. Most questions I had were about how Elasticsearch works, which I could get from the docs.

I could be wrong, but I would say we will be fine with the basic support. But you can ask for opinions from more people on this.

luislhl commented 2 years ago

We will set-up an additional layer of cache on API Gateway, that will be invalidated at the same time Logstash runs the daily sync.

How will you achieve this? If you just set a cache with 24h TTL, each key will be invalidated at its own time, right?

You would need to run some periodic task that triggers the cache invalidation in API Gateway to make sure everything is invalidated at the same time Logstash runs.

Maybe we could just use lower TTLs for the cache in API Gateway, like 30 minutes.

If a task fails, the Logstash DLQ will be populated and the information will be sent to a SNS topic. If a DLQ message is sent to the SNS, a CloudWatch alarm will be triggered, which will trigger a Lambda function that will send a message to Slack, where dev team is tracking errors.

We have a similar structure already setup with SNS Topics and AWS Chatbot.

Each channel in our Slack (alerts-critical, alerts-major, alerts-minor and alerts-warning) has a corresponding SNS Topic in regions eu-central-1 and us-east-1 (maybe others)

Anything sent to these topics are sent to Slack by AWS Chatbot.

I think for your case this structure would be enough, right? You would just need to setup one of our SNS topics as destination.

Two task definitions will be created. One that will connect to RDS and other that will connect to the S3.

I suspect it's possible to do everything you need with only one Logstash task. Logstash has some advanced data manipulation capabilities.

I think in only one task It could grab data from RDS, then from S3, and combine them in a single document that would be sent to Elasticsearch.

Maybe you should start implementation by doing a PoC with Logstash, which seems to be the most uncertain thing in the design.

ElasticSearch will be monitored using the built-in Performancce metric provided by Elastic Cloud:

The only drawback of this is that it will be difficult to create alerts based on the metrics. I think Elastic Cloud don't have any built-in alerting tool.

They usually recommend setting up another Elasticsearch to act as monitor of the main one, which would allow us to use Kibana alerting tools, but this would increase costs a lot for a small case like ours.

If think there is also the possibility of making a Elasticsearch act as the monitor of itself, by sending the metrics to itself. This is better in terms of cost, but worse in terms of stability, because if the deployment becomes unavailable we also lose access to its metrics and logs, and because it generates additional load on the deployment.

Anyway, I think you could just mention this as a possible future improvement, but we should be ok with the built-in metrics for now.

https://www.elastic.co/guide/en/cloud/current/ec-enable-logging-and-monitoring.html

https://www.elastic.co/blog/monitoring-elastic-cloud-deployment-logs-and-metrics

Logstash will run as a AWS Fargate task and it will run once a day to sync data to ElasticSearch, using Amazon EFS as storage.

You should consider creating this infrastructure using Terraform.

We have been using it for some more complex cases, and this one seems to be a good candidate.

It will also be useful to perhaps deploy the infra in your AWS sandbox account, then easily reproduce exactly the same infra in our main account.

I can help you with this if needed.

There is also the possibility of running Logstash in Kubernetes. I think it would also work and may be easier to setup. But I'm not really sure. AWS Fargate seems ok as well if it's easier for you.

pedroferreira1 commented 2 years ago

I like the design in general, just some thoughts about it:

  1. We need to consider the cost for the testnet as well. We have the explorer service running for both mainnet and testnet. Can we use the same Elastic Cloud for both or the cost will be 2x?
  2. Having the tokens updated once a day seems not good. Imagine a user creating a new token and opening the explorer. Is it possible for the wallet service to trigger this update when the token table is changed? Or just reduce this time, what's the cost for doing it every 30 minutes? Or every minute?
  3. How are you going to get S3 metadata updates? For the wallet service you added a new column to know which tokens to get but from the S3 this might not be easy. This S3 input plugin has some features to help you with that?
lucas3003 commented 2 years ago

@luislhl

I have used Elastic Cloud before and I never needed to reach them for support. Most questions I had were about how Elasticsearch works, which I could get from the docs.

I could be wrong, but I would say we will be fine with the basic support. But you can ask for opinions from more people on this.

I agree that we can start using the basic support, then move to a better plan if needed.

How will you achieve this? If you just set a cache with 24h TTL, each key will be invalidated at its own time, right?

You would need to run some periodic task that triggers the cache invalidation in API Gateway to make sure everything is invalidated at the same time Logstash runs.

Maybe we could just use lower TTLs for the cache in API Gateway, like 30 minutes.

I agree with you. I changed the API Gateway cache TTL to 30 minutes.

We have a similar structure already setup with SNS Topics and AWS Chatbot. [...]

I updated the design to indicate that we will go towards this alredy built structure.

I suspect it's possible to do everything you need with only one Logstash task. Logstash has some advanced data manipulation capabilities.

I think in only one task It could grab data from RDS, then from S3, and combine them in a single document that would be sent to Elasticsearch.

Maybe you should start implementation by doing a PoC with Logstash, which seems to be the most uncertain thing in the design.

As we talked on Slack, as RDS and S3 sources will be triggered in different moments, we will keep as two different tasks. I agree, as Logstash PoC being the first part of the implementation.

The only drawback of this is that it will be difficult to create alerts based on the metrics. I think Elastic Cloud don't have any built-in alerting tool. [...]

I agree we will not have the best monitoring system at this moment, and I have included on the design that as a follow-up.

You should consider creating this infrastructure using Terraform.

Great. I think the infrastructure needs to be defined in code, which brings many benefits. I have included that on the design.

Please let me know if there is any other pending point.

lucas3003 commented 2 years ago

@pedroferreira1

We need to consider the cost for the testnet as well. We have the explorer service running for both mainnet and testnet. Can we use the same Elastic Cloud for both or the cost will be 2x?

As we do not have too much data now (250k entries), I think we can use the same cluster at this moment, but we must keep this in mind for future scalability.

Having the tokens updated once a day seems not good. Imagine a user creating a new token and opening the explorer. Is it possible for the wallet service to trigger this update when the token table is changed? Or just reduce this time, what's the cost for doing it every 30 minutes? Or every minute?

It would be much better for user experience. We basically would need a container continuously running, which would increase our cost in approximately $36/month. Do you think this is too high for this increased user experience?

How are you going to get S3 metadata updates? For the wallet service you added a new column to know which tokens to get but from the S3 this might not be easy. This S3 input plugin has some features to help you with that?

There is a discussion here where an Elastic Team Member details it:

The way the S3 input plugin works today is basically this:

  1. List all objects in the given prefix
  2. Process any objects found by that listing.
  3. Update sincedb each time an object completes processing.
  4. Go back to step 1, ignoring any already-processed objects.
andreabadesso commented 2 years ago

This design is approved to me

luislhl commented 2 years ago

The continuously-running container will cost approximately 36 USD/month

What resources did you consider in this estimation? 1 vCPU and 2 GB? Is it possible to use fractions of a vCPU in Fargate? Maybe Logstash won't need a full vCPU.

I think I will change my opinion about where we should run Logstash.

By running it in Kubernetes we would remove the need of creating the definitions in Terraform and would only need to create them in Kubernetes, which will be simpler. I actually have an example of how to do this already.

And maybe we will be able to run it with less costs, or at least similar costs.

I suggest you try a PoC of running it in Kubernetes and I help you with this, and if you think it is really simple enough, we could stick with this solution. Otherwise, we keep Fargate solution.

What do you think?


But this is my last consideration, everything else is approved

lucas3003 commented 2 years ago

I agree! Let's try making this PoC oon Kubernetes, and then we decide if it is the best solution.