Open marcocastignoli opened 3 months ago
Could you explain what benefits the second proposal would bring compared to our current setup? Wouldn't just one server also scale well with GCP?
@manuelwedler
Could you explain what benefits the second proposal would bring compared to our current setup? Wouldn't just one server also scale well with GCP?
With our current setup we cannot unbind requests from verifications: a request is pending until the verification process is over. We need some way to separate verification from http requests if we want to support receipts in API v2
We could potentially separate them in the same "sourcify-server" process but then we are not taking advantage of GCP scaling by optimizing on the number of requests:
@kuzdogan
Do we need "granular control on what's in the queue" or "priority systems"?
I cannot think of any real use case of this, other than prioritizing some chains
How do we send "tickets" in the second case?
In the diagram I wrote "Read Status" from sourcify-http-server
to sourcify-database
:
sourcify-http-server
calls sourcify-verification-service
triggering a new verificationsourcify-verification-service
stores a new receipt in the database as "pending" and returns the receipt id to sourcify-http-server
, then starts the verification process, then marks the receipt as completed once donesourcify-http-server
can always read the status reading directly from the databaseTo be able to proceed here, some more feedback:
If I get it right, GCP Cloud Run scales the number of instances based on pending http requests or when the CPU utilization gets above some percentage. So the sourcify-verification-service
receives requests to verify, but these are closed after returning the receipt id and therefore it does not necessarily scale up while verifying. Is this how you meant it to be? This would mean the sourcify-verification-service
would need to spin up some workers interally that handle the verification. This could be implemented in different ways:
Maybe I am wrong here with my assumptions, so happy to hear your opinion on this.
We should look into what options we have for such a queue-service
. There is for example the Google managed Cloud Taks. We should look into what ups and downs different approaches have. For example, I think such an external queue service can also provide some benefits in terms of logging and debugging purposes. Maybe it would be good to have a small list.
In general, I think we need to look a bit closer here in the two approaches and also define the internal structure of the components to decide which option is best.
Summarizing the call and the next steps:
We agreed on 3 viable options:
Similar to option 1 above, having a Queueing Service and a separate Verification Service. Keeping this short as we did not discuss the details.
I guess in this case the scaling will be handled by the Queueing Service itself?
Leaving it here to keep this option open.
Similar to option 2 above, just having a separate Verification service:
In this case the rough flow is as following:
verificationJob
to the DB and responds to the HTTP server the job IDisCompleted: true
with the verification resultScaling: In this case, the Verification Services will be scaled by their CPU usage. Once a certain use is hit (in GCP Cloud Run 60%) a new instance is spun up and new requests from the HTTP server will be routed to the new instance. This should also be compatible with other scalable deployments e.g. Kubernetes.
In the call, a third option has been proposed that requires no separate service (just an HTTP server) but outsources the async task to Workers.
In this case the rough flow is as following:
verificationJob
, and sends the job ID back to the user. Finally, it spins up a Worker with this job ID.isCompleted: true
with the verification resultScaling: Here the server instances get scaled with the CPU use, similar to how it's done at the moment. Since the server instances are stateless, it is easily possible.
We'd like to create simple sequence diagrams of the last 2 proposals to make them easily understandable. After that we'll contact Markus from the Devops team for his feedback.
- Meanwhile the Verification Service compiles and processes the verification. It writes the result to the Database upon (successful or unsuccessful) compilation.
This in the second option also implies a "worker". A worker is just a term for any background task that is being processed. We could also just call it background task or something similar, but I imagine it to be a class that gets instantiated with the request and handles the verification in the background then. This class could be called VerificationWorker
for example. So the only actual difference about the second and third option is that we either split the server into two services or not.
I also updated your comment to make this clear.
Here are the sequence diagrams for option number 2 and 3:
Recap of my conversation with Markus.
The monolithic option 3 is fine: the only downside is that we would scale parts of our service that doesn't need scaling:
Option 2 is ideal because it separates http and verification scaling concerns but it comes with additional effort:
I'm not sure if I get the 2. point:
this makes the deployment process incredibly more complex.
I understand we'll have two components to build and deploy but is there something I'm missing that'll make it incredibly complex?
Instead of having the verification service handle database operations, designate the HTTP service as the sole component responsible for writing to the database. The verification service would then return the necessary verification information to the HTTP server, eliminating the need for it to directly interact with the database.
I don't get why it is favorable to have the HTTP server do the DB operations instead of the Verification Server.
Overall to me the downsides of number 3 are not a big concern, compared to the development effort that'll be needed. I think we can just increase the request count limit high enough to mostly scale for the CPU instead.
In response to
I understand we'll have two components to build and deploy but is there something I'm missing that'll make it incredibly complex?
I'm citing Markus:
you'd need to keep both services aligned and need to deploy changes at the same time. so they get coupled and then it's basically a monolith as you can't develop both services independently. and you'd need to make sure only one is doing migrations or do migrations out of bounds aka manually so the two services won't fight each other wehn starting and applying migrations also if you have to update "always" both at the same time, then you also can just have everything in one codebase.
I honestly also didn't fully get this point. It's not a huge deal to keep everything synchronized. Probably this becomes a problem when you have to keep different versions online or deploy with 0 downtime or you have more than 2 services.
you'd need to keep both services aligned and need to deploy changes at the same time. so they get coupled and then it's basically a monolith as you can't develop both services independently. and you'd need to make sure only one is doing migrations or do migrations out of bounds aka manually so the two services won't fight each other wehn starting and applying migrations also if you have to update "always" both at the same time, then you also can just have everything in one codebase.
I honestly also didn't fully get this point. It's not a huge deal to keep everything synchronized. Probably this becomes a problem when you have to keep different versions online or deploy with 0 downtime or you have more than 2 services.
I still think Markus makes some valid points here. For example, maintaining a database module for two services increases the maintenance burden. I agree that deploying at the same time does not seem like a big issue for us at the moment, but for being future-proof and having a clean architecture, decoupling the services seems the better option to me. So if we go with option 2, I would also integrate Markus' proposals.
Overall, I think option 3 is very easy to implement for us and option 2 just means reduced costs compared to 3. As costs are not a priority at the moment, I would also go with 3 for now. It should also be possible to upgrade the architecture from 3 to 2 if we feel like there is the need later.
I'm also in favor of option 3 for its ease
Then we all agree!
Context
/v2/verify
response, this will allow to separate HTTP request from the verification status.Solutions
We are exploring two solutions:
sourcify-http-server
andsourcify-verification-service
.sourcify-http-server
will push pending contracts to thequeue-service
andsourcify-verification-service
will read pending contracts fromqueue-service
verifying them and marking them as completed. This solution involves setting up a queue service adding more complexity to our architecture, but we have granular control on what's in the queue, enabling us to potentially implement priority systems.sourcify-http-server
andsourcify-verification-service
will be deployed as Google Cloud Run Services,sourcify-http-server
will receive /verify request from the internet and callsourcify-verification-service
directly without passing through a queue. The verification's status is going to be saved insourcify-database