ethereum / sourcify

Decentralized Solidity contract source code verification service
https://sourcify.dev
MIT License
787 stars 400 forks source link

Researching the queueing/ticketing system in the GCP context and in Cloud Run Services #1541

Open marcocastignoli opened 3 months ago

marcocastignoli commented 3 months ago

Context

Solutions

We are exploring two solutions:

graph TD
    User -->|/verify| sourcify-http-server
    sourcify-http-server -->|Push Pending Contracts| queue-service
    queue-service -->|Read Pending Contracts| sourcify-verification-service
    sourcify-verification-service -->|Mark as Completed| sourcify-database
    sourcify-http-server -->|Read Status| sourcify-database
    sourcify-http-server -->|Read Status| queue-service
graph TD
    User -->|/public_api_verify| sourcify-http-server
    sourcify-http-server -->|/internal_api_verify| sourcify-verification-service
    sourcify-verification-service -->|Write Status| sourcify-database
    sourcify-http-server -->|Read Status| sourcify-database
kuzdogan commented 3 months ago
manuelwedler commented 3 months ago

Could you explain what benefits the second proposal would bring compared to our current setup? Wouldn't just one server also scale well with GCP?

marcocastignoli commented 3 months ago

@manuelwedler

Could you explain what benefits the second proposal would bring compared to our current setup? Wouldn't just one server also scale well with GCP?

With our current setup we cannot unbind requests from verifications: a request is pending until the verification process is over. We need some way to separate verification from http requests if we want to support receipts in API v2

We could potentially separate them in the same "sourcify-server" process but then we are not taking advantage of GCP scaling by optimizing on the number of requests:

marcocastignoli commented 3 months ago

@kuzdogan

Do we need "granular control on what's in the queue" or "priority systems"?

I cannot think of any real use case of this, other than prioritizing some chains

How do we send "tickets" in the second case?

In the diagram I wrote "Read Status" from sourcify-http-server to sourcify-database:

manuelwedler commented 1 month ago

To be able to proceed here, some more feedback:

Second approach

If I get it right, GCP Cloud Run scales the number of instances based on pending http requests or when the CPU utilization gets above some percentage. So the sourcify-verification-service receives requests to verify, but these are closed after returning the receipt id and therefore it does not necessarily scale up while verifying. Is this how you meant it to be? This would mean the sourcify-verification-service would need to spin up some workers interally that handle the verification. This could be implemented in different ways:

Maybe I am wrong here with my assumptions, so happy to hear your opinion on this.

First approach

We should look into what options we have for such a queue-service. There is for example the Google managed Cloud Taks. We should look into what ups and downs different approaches have. For example, I think such an external queue service can also provide some benefits in terms of logging and debugging purposes. Maybe it would be good to have a small list.

In general, I think we need to look a bit closer here in the two approaches and also define the internal structure of the components to decide which option is best.

kuzdogan commented 1 month ago

Summarizing the call and the next steps:

We agreed on 3 viable options:

1. Queueing Service + Verification Service + HTTP Server

Similar to option 1 above, having a Queueing Service and a separate Verification Service. Keeping this short as we did not discuss the details.

I guess in this case the scaling will be handled by the Queueing Service itself?

Leaving it here to keep this option open.

2. Verification Service + HTTP Server

Similar to option 2 above, just having a separate Verification service:

In this case the rough flow is as following:

Scaling: In this case, the Verification Services will be scaled by their CPU usage. Once a certain use is hit (in GCP Cloud Run 60%) a new instance is spun up and new requests from the HTTP server will be routed to the new instance. This should also be compatible with other scalable deployments e.g. Kubernetes.

3. Only one HTTP Server

In the call, a third option has been proposed that requires no separate service (just an HTTP server) but outsources the async task to Workers.

In this case the rough flow is as following:

Scaling: Here the server instances get scaled with the CPU use, similar to how it's done at the moment. Since the server instances are stateless, it is easily possible.

Next steps

We'd like to create simple sequence diagrams of the last 2 proposals to make them easily understandable. After that we'll contact Markus from the Devops team for his feedback.

manuelwedler commented 1 month ago
  • Meanwhile the Verification Service compiles and processes the verification. It writes the result to the Database upon (successful or unsuccessful) compilation.

This in the second option also implies a "worker". A worker is just a term for any background task that is being processed. We could also just call it background task or something similar, but I imagine it to be a class that gets instantiated with the request and handles the verification in the background then. This class could be called VerificationWorker for example. So the only actual difference about the second and third option is that we either split the server into two services or not.

I also updated your comment to make this clear.

manuelwedler commented 4 weeks ago

Here are the sequence diagrams for option number 2 and 3:

2. Verification Service + HTTP Server

queue_option2 drawio

3. Only one HTTP Server

queue_option3 drawio

marcocastignoli commented 4 weeks ago

Recap of my conversation with Markus.

The monolithic option 3 is fine: the only downside is that we would scale parts of our service that doesn't need scaling:

Option 2 is ideal because it separates http and verification scaling concerns but it comes with additional effort:

kuzdogan commented 4 weeks ago

I'm not sure if I get the 2. point:

this makes the deployment process incredibly more complex.

I understand we'll have two components to build and deploy but is there something I'm missing that'll make it incredibly complex?

Instead of having the verification service handle database operations, designate the HTTP service as the sole component responsible for writing to the database. The verification service would then return the necessary verification information to the HTTP server, eliminating the need for it to directly interact with the database.

I don't get why it is favorable to have the HTTP server do the DB operations instead of the Verification Server.

Overall to me the downsides of number 3 are not a big concern, compared to the development effort that'll be needed. I think we can just increase the request count limit high enough to mostly scale for the CPU instead.

marcocastignoli commented 4 weeks ago

In response to

I understand we'll have two components to build and deploy but is there something I'm missing that'll make it incredibly complex?

I'm citing Markus:

you'd need to keep both services aligned and need to deploy changes at the same time. so they get coupled and then it's basically a monolith as you can't develop both services independently. and you'd need to make sure only one is doing migrations or do migrations out of bounds aka manually so the two services won't fight each other wehn starting and applying migrations also if you have to update "always" both at the same time, then you also can just have everything in one codebase.

I honestly also didn't fully get this point. It's not a huge deal to keep everything synchronized. Probably this becomes a problem when you have to keep different versions online or deploy with 0 downtime or you have more than 2 services.

manuelwedler commented 3 weeks ago

you'd need to keep both services aligned and need to deploy changes at the same time. so they get coupled and then it's basically a monolith as you can't develop both services independently. and you'd need to make sure only one is doing migrations or do migrations out of bounds aka manually so the two services won't fight each other wehn starting and applying migrations also if you have to update "always" both at the same time, then you also can just have everything in one codebase.

I honestly also didn't fully get this point. It's not a huge deal to keep everything synchronized. Probably this becomes a problem when you have to keep different versions online or deploy with 0 downtime or you have more than 2 services.

I still think Markus makes some valid points here. For example, maintaining a database module for two services increases the maintenance burden. I agree that deploying at the same time does not seem like a big issue for us at the moment, but for being future-proof and having a clean architecture, decoupling the services seems the better option to me. So if we go with option 2, I would also integrate Markus' proposals.

Overall, I think option 3 is very easy to implement for us and option 2 just means reduced costs compared to 3. As costs are not a priority at the moment, I would also go with 3 for now. It should also be possible to upgrade the architecture from 3 to 2 if we feel like there is the need later.

kuzdogan commented 3 weeks ago

I'm also in favor of option 3 for its ease

marcocastignoli commented 3 weeks ago

Then we all agree!