backdrop-ops / backdrop-community

A queue for all the things that are not directly related to another project.
0 stars 0 forks source link

Find an alternative to frequently problematic zen.ci #1

Closed alanmels closed 2 years ago

alanmels commented 3 years ago

Every single time we create PRs we have to close and re-open them multiple times to get passed by quite frequently problematic zen.ci. Sometimes we have to wait until something resets on zen.ci side and then only try to pass it's validation. And it could take hours or even days. This is becoming bit annoying.

Not sure why this service is so unstable, but I can guess that's because their servers are located in Russia - zen.ci is a product of a Russian company - IT Patrol inc, which went through major incident recently. And though for English-speaking users they've put the following announcement on http://it-patrol.com:

Dear clients and visitors!

We have significantly altered our website and is logically divided our services. Our corporate website will content only general information about company and list of our services.

the real reason can be found on: http://drupalhosting.ru/newsletters/%D0%BF%D1%80%D0%BE%D0%B8%D1%81%D1%88%D0%B5%D1%81%D1%82%D0%B2%D0%B8%D0%B5-10-%D0%BC%D0%B0%D1%80%D1%82%D0%B0-%D0%B8-%D0%BD%D0%B0%D1%88%D0%B8-%D0%B2%D1%8B%D0%B2%D0%BE%D0%B4%D1%8B

Here is Google Translation:

Dear clients, on the night of March 10, we ran into a problem with you, the possibility and scale of which we could not even imagine.

On the night of March 9-10, the building in the Data Center burns down, where 4 of our servers stood at once. Namely, servers called Malleus, TH3, TH4 and TH5.

These servers burn out irrevocably. The situation is further aggravated by the night time of the incident and the magnitude of the tragedy, since, according to some reports, several tens of thousands of clients of this DC were injured besides us. This affected, among other things, the ability to order new servers, the terms of orders were almost everywhere shifted to unacceptable terms for us. We managed to find an opportunity to order new servers within just one day.

In the absence of information on the timing of server recovery on the part of the Data Center, we in an emergency mode began to restore the data of affected clients to new servers from our backup servers. All this was done in 2 days (in some situations, the work is still ongoing). It's fast enough. At the moment, most of the clients' sites open normally, all services are working and we plan to further stabilize the situation. It is also known that work is still underway in the damaged Data Center, and nothing has yet been restored and will be restored according to plans not earlier than in a week. Therefore, we believe that we have done the job quickly enough in the current situation.

However, we are aware that this is a very unpleasant situation for site owners, so all orders that were located on the affected servers were extended by a month in advance.

Our further plans for improving the situation are as follows: We collect statistics on what and how is currently working on new servers. There is an understanding that due to the haste and urgency of the situation, they were not entirely correct. And in the near future we are planning to set up new servers, in terms of more stable work, the speed of the channels, the correctness of the mail, and the like. After that, the affected accounts will be transferred in stages to it to ensure the stability of the services. Not the least factor is the quality of technical support in the new DC, which does not suit us. Unfortunately, this is the situation in almost any hosting in the Russian Federation. We have already started preparing a new server infrastructure some time ago to ensure the development of our services and their quality improvement. The events that have taken place only strengthened us in the opinion that a new infrastructure is needed and after the situation stabilizes, we will accelerate its implementation. We have compiled an internal document with sufficiently detailed information on past events, on the basis of which we have drawn conclusions about what can be improved, what to take into account and how it could be done differently if similar situations occur in the future (let's hope not).

Once again, we sincerely apologize to our customers. We remind you that we always try to help our clients, it is very important for us that your sites work well and stably. And we experienced these events together with you.

Regardless, of the incident and possible degree of it's negative effect on our operations, it was always problematic, even before the incident. So it would be better to find a better alternative.

herbdool commented 3 years ago

I agree it might be time to move away from Zen CI. I'm grateful for all the work Gor put into it and for providing free hosting for automated tests. I didn't realize the servers were located in Russia. Gor, at least last I heard from him, lives in Canada.

What other options do we have? Go back to Travis CI? We'd need to investigate the database optimization that Zen CI had specifically for Backdrop and see if it can be replicated in other services as well.

alanmels commented 3 years ago

I agree it might be time to move away from Zen CI. I'm grateful for all the work Gor put into it and for providing free hosting for automated tests. I didn't realize the servers were located in Russia. Gor, at least last I heard from him, lives in Canada.

Might as well be in Canada. That was my guess based on the fact they work with Russian-speaking clientele. Also, not to offend any Russian-speaking person (as myself am originally form that part of the world), but data centers more likely to burn down in Russia than in Canada.

herbdool commented 3 years ago

Options:

stpaultim commented 3 years ago

Just to be clear, for folks like myself that might have been a bit confused. We are talking about automated tests now, not sandbox sites. I was confused by this issue, because I thought we were already moving away from Zen.ci, but that is just for sandboxes. https://github.com/backdrop/backdrop-issues/issues/4351

Automated tests is another issue. I believe that this has been discussed a number of time and as far as I know, the only think blocking us from changing to another service is: 1) A decision on what to use as an alternative 2) A volunteer willing to take the initiative to make this happen.

ghost commented 3 years ago

See https://github.com/backdrop/backdrop-issues/issues/4753 re. using GH Actions.

herbdool commented 3 years ago

I appreciate that this one is about looking at a bunch of alternatives and not just GitHub Actions. Though the latter might be the best option, it's also good to just compare some at some level.

klonos commented 3 years ago

the only think blocking us from changing to another service is 1) ... 2) ...

And 3) Gor has done some serious magic to lower the test times down to ~5min for php7 and less than 10min for php5. These times have increased slightly over the course of time, but it's still 6-7min for php7 and 10-15min for php5, which is still far better than what Drupal tests take to run.

herbdool commented 3 years ago

That's what I recall too. I think a part of it was having a cached copy of the fresh db for testing. That might be in core already, I'm not sure.

But there's something else but would have to look at the old issues first.

klonos commented 3 years ago

I thought I should link these 2 issues from here:

herbdool commented 3 years ago

If you think we're having problems with tests running it's nothing compared to https://github.com/backdrop/backdrop-issues/issues/1422.

Also, here's where Gor made a bunch of improvements. Luckily they're not specific to Zen CI: https://github.com/backdrop/backdrop-issues/issues/1835

hosef commented 3 years ago

Currently, the big problem is that we are constrained by the database processing speed. The big optimization that Gor did was to convert the cached profile tables to MyISAM, which is not supported well in newer versions of MySQL. MyISAM is much faster than INNODB when creating or dropping tables because in INNODB any queries that alter the schema are locking for the whole database.

I have been trying to work on optimizing the tests for quite a while, but there is a limit on how much I can remove database queries from the system. https://github.com/backdrop/backdrop-issues/issues/4353 https://github.com/backdrop/backdrop-issues/issues/4943

klonos commented 3 years ago

https://circleci.com/open-source

We support the open source community. Organizations on our free plan get 400,000 credits per month for open source builds.

...their https://circleci.com/pricing page mentions that their "Performance" plan "Starts at 50,000 credits" per month, which translates to "25,000 credits/month for the first 3 users. 25,000 credits/month for each additional user." ...so if my math is right, those free 400k credits for open source would translate to 3+15=18 users/month, which would be more than enough for the size of our community as it is now.

...they mention "Scale up to 80x concurrency" for this "Performance" plan, but I'm not sure if we could get that on the free plan (most likely not, but we should ask).

PS: I'm not sure how fast those credits would be spent, but we could give the free plan a go, and see if it works for us. For an initial phase, we should have both ZenCI as well as any other solution running; then once enough time has passed, we'd be able to make a decision based on how it performs.

herbdool commented 3 years ago

@klonos I don't think we can assume that "user" in Circle CI will translate to individual github users. Instead, I suspect it might be that the Backdrop core project would have only 1 "user" that will be automated.

ghost commented 2 years ago

We're not using Zen.ci anymore. We've moved our sandboxes to Tugboat and our tests to GitHub Actions. So closing this issue as complete.