conda-forge / conda-forge.github.io

The conda-forge website.
https://conda-forge.org
BSD 3-Clause "New" or "Revised" License
124 stars 273 forks source link

Procuring more macOS workers #306

Closed wesm closed 5 years ago

wesm commented 7 years ago

Hello -- what is the current status vis-a-vis OS X build capacity for CF? I opened a build a little over 4 hours ago and it still hasn't processed: https://github.com/conda-forge/parquet-cpp-feedstock/pull/10

By the time that's merged, then master builds, we could be talking 12-24 hour turnaround time to update a package only requiring a change in build and version number.

What's the status of fiscal sponsorship for CF and Travis CI in particular? Can I help with procuring funds to obtain more workers? Have you considered joining NumFOCUS to provide another conduit of tax-deductible donations? We should also put a big blue DONATE button on the conda-forge homepage (but we would need a legal entity set up to collect the donations).

I'm willing to invest a bunch of my energy in helping fund raise because of the amount of time this has been costing me, so let me know how I can help.

ocefpaf commented 7 years ago

@wesm I contacted Travis-CI some time ago and, surprisingly, paying for it would not improve our situation that much. With that said I believe that of @conda-forge/core agrees with what you said. However, we lack manpower to take action on the points you raised.

PS: Would you be willing to participate in one of our meetings? We did not not schedule the next one yet, but when we do so it will be announced in https://conda-forge.hackpad.com/conda-forge-meetings-2YkV96cvxPG

ChrisBarker-NOAA commented 7 years ago

And yes, we have talked about NumFocus as a conduit for donations. I think we all agree that it's a good idea, but as @ocefpaf said, someone needs to be willing to take the lead and get the hasseling done.

Also, it still isn't clear if $$ will solve the problems anyway.

In short, great ideas, and we welcome any input and assistance with the effort.

-CHB

jakirkham commented 7 years ago

Should add that one of the things that we have discussed over the past few months is having some sort of BYOC (bring your own compute) type build system where people could contribute computers they have for building. Some sort of VM would be installed on them where the builds run. These would be leveraged by some sort of CI system to farm out the builds to different workers. It could also be used for things requiring longer build times. IIUC this repo houses the current work towards this goal. @msarahan shared his work on this during one of our meetings near the end of last year.

jakirkham commented 7 years ago

Also, as another thing to consider, we could look to Travis CI's supplier of OS X infrastructure. Their supplier is MacStadium. Though there may still be more work to be done. Travis CI provides us a nice interface that would take time to recreate and maintain.

SylvainCorlay commented 7 years ago

A volunteer computing CI service (a build@home similar to Folding@home or SETI@home) would be awesome! Donating compute for CI of a whitelist of GitHub organizations would give me the impression of having more direct impact than the equivalent amount of computing donated to SETI@Home :)

With 4 pending build (xsimd, cryptopp, ipympl and and update to zeromq) I find this already quite frustrating. I can't imagine how core conda-forge devs must feel!

jakirkham commented 7 years ago

In an attempt to try and improve the situation with Travis CI, have written up PR ( https://github.com/conda-forge/staged-recipes/pull/2257 ). This includes a similar strategy as is used on AppVeyor to fast finish old PR builds. Also tries to cache some things like conda packages and the Miniconda installer to speed things up. There are some other changes of lesser note. Would appreciate if people interested in this issue gave that PR a look over.

SylvainCorlay commented 7 years ago

I have the impression that travis is not building anything on conda forge at some points in time, even when global backlog is empty.

jakirkham commented 7 years ago

The backlog was low last night and it seemed to be building things then. Though admittedly with a sizeable delay. Certainly when the backlog is large, they can no longer guarantee their 5 workers per user/org, which is where I think things go off the rails. It is possible their is additional throttling in the equation that we don't know of.

Still I think if we can use some of the time we are give to help cutdown the queue size, that should improve the situation. It is not as great as say avoiding unneeded builds being queued at all and/or canceling them without using CIs. ( https://github.com/conda-forge/conda-forge-webservices/issues/79 ) Though this has fewer limitations and is something we are able to do immediately without waiting for changes from Travis CI.

SylvainCorlay commented 7 years ago

I have been canceling a few builds that were in the queue but not valid anymore (rebased PRs etc).

Right now, the conda-forge queue is about 23 - 24 hours long...

SylvainCorlay commented 7 years ago

By the way, just like @wesm, if there is anything I can do to help fund more os x compute, I would love to know!

SylvainCorlay commented 7 years ago

The backlog was low last night and it seemed to be building things then. Though admittedly with a sizeable delay. Certainly when the backlog is large, they can no longer guarantee their 5 workers per user/org, which is where I think things go off the rails. It is possible their is additional throttling in the equation that we don't know of.

I don't think that any build was done by travisci for conda-forge over the past couple of hours. :unamused:

jakirkham commented 7 years ago

This build ran ~20 mins ago and this one an hour ago. Things are running albeit slowly. However Travis' GUI unlike AppVeyor's is not great at displaying this information and people have asked them to fix it in various issues.

jakirkham commented 7 years ago

If anyone would like to help, I'd appreciate more feedback on PR ( https://github.com/conda-forge/staged-recipes/pull/2257 ) to help cull the queue on Travis.

SylvainCorlay commented 7 years ago

This build ran ~20 mins ago and this one an hour ago. Things are running albeit slowly. However Travis' GUI unlike AppVeyor's is not great at displaying this information and people have asked them to fix it in various issues.

That was after my comment :smile: (and more than 24 hours after the commit)

jakirkham commented 7 years ago

Only trying to point out the situation is not totally hopeless. 😄

In any event, the long term fix is our own build infrastructure.

The short term fix will probably be plugging the holes in the dam. That said, I think I found a pretty serious problem with Travis (running outdated builds). Fixing that problem with AppVeyor has been crucial for getting to where we are today. While this may seem like extra or wasted work, it is not really as we will still require similar solutions for our own build infrastructure too. So we can likely reuse lessons learned when we migrate.

jakirkham commented 7 years ago

Have written up a CI fast finish script in PR ( https://github.com/conda-forge/conda-forge-build-setup-feedstock/pull/52 ). It checks if a build is out of date for the given PR and fails it immediately if it is. Tested it on CircleCI, Travis CI, and AppVeyor. The script has no dependencies outside of a Python interpreter and works on 2 or 3. As such, it can safely be downloaded on all CIs and just run as all of them have some Python interpreter available (either 2 or 3). Since it is written in Python, it is much cleaner that it would be in shell and is a bit more web friendly. Please take a look.

patricksnape commented 7 years ago

I would also be interested in any solutions we might have for improving our Travis builds - I would also be willing to donate if we felt like some agreement (including payments) might be possible with Travis themselves.

wesm commented 7 years ago

I can help with soliciting corporate donations to pay for additional Travis CI build capacity. Please contact me offline if this is possible

SylvainCorlay commented 7 years ago

Same here. I would love to help with funding if this is something you are pursuing.

jakirkham commented 7 years ago

The only way ATM that I can see paying for Travis CI helping is if we can get our own dedicated queue with a few workers. FWICT it doesn't seem like this is something they currently offer. However, we can always ask. Would someone be willing to take this on?

scopatz commented 7 years ago

@jakirkham do you know who to talk to?

scopatz commented 7 years ago

That is, do we have a contact?

jakirkham commented 7 years ago

We don't have a contact currently, no. Though I have emailed them at their support email ( support travis-ci com ) before and have gotten pretty good response. Would recommend the same here unless there are other suggestions.

scopatz commented 7 years ago

Ok I will email them now.

scopatz commented 7 years ago

Email sent, I'll let you know what I hear back

jakirkham commented 7 years ago

Awesome! Thanks so much @scopatz. 😄

scopatz commented 7 years ago

Hello All, I just received a reply from Travis.

So the good news is that they do supply dedicated hardware and queues for mac infrastructure. The technical specifications are:

They only offer groups of two machines. Each machine is $1500 / month, so the minimum cost is $3000 / month.

I am planning on having a call to touch base with them sometime during the week of March 12th. If you have specific questions you'd like me to ask please let me know. Or maybe we should have a document to list some of the issues to bring up. I know I have a couple of my own.

ChrisBarker-NOAA commented 7 years ago
  • Mac Pros
  • 12 core
  • 32 GB RAM
  • 5 concurrent jobs per machine

5 jobs on a 12 core machine?

They only offer groups of two machines. Each machine is $1500 / month, so the minimum cost is $3000 / month.

that seems pricey to me, even for dedicated machines. But what do I know? A few folks have offered to put up some cash -- but anywhere near that much?

How steady is the Vonda-forge load? It seems we don't really want dedicated machines, but rather, priority on the queue/load balancer.

But I guess they don't offer that.

I wonder if we could make it a numfocus account and share it with others --- that would steady the load.

Who might want to share?

The MacWheel folks might, anyone else?

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

scopatz commented 7 years ago

@ChrisBarker-NOAA, yeah one of my questions is about how many orgs you can share this across, how long you have to buy in for, etc.

I probably missed something, but what is vonda-forge?

And I actually do think that we want a dedicate resource. Higher queue priority still wouldn't allow us to extend the max job time, which is something that we'll need if we want to build gcc, clang, etc.

ocefpaf commented 7 years ago

I probably missed something, but what is vonda-forge?

A typo. (c is way to close to v :wink:)

I agree that ~3000.00 is pricey :confused: In 2-3 years that would be equal to buy our own hardware and hire someone to do it...

jakirkham commented 7 years ago

Thanks @scopatz. That's very useful information. Would be interested in hearing what comes out of the call. Will try to come up with some questions to send to you in the interim.

I mentioned MacStadium is Travis CI's provider for Mac machines (unless that has changed). It is probably worth comparing their offerings.

For MacStadium's high end cloud offering (they have lower end ones too), it runs $2799/mo provides 36 cores and 192GB RAM. They advertise having 1 VM per core (or better). The "better" part sounds fishy to me, but 1 VM per core sounds potentially reasonable especially given that there would be a little over 5GBs of RAM per instance (unclear on what overhead the infrastructure has).

The challenge is I have no idea what sort of software MacStadium provides to manage this. Travis CI is a known quantity that we can use and plugin to our existing infrastructure. Will MacStadium fit this bill or will we need to put some software of our own on the system? In particular, think about provisioning, sharing log files, possibly caching, queuing builds, notifying of build completion. If the latter, who has time to help maintain this? Perhaps this is the cost differential in the end is having a product ready to go vs. taking on some of the work on our own. We certainly could schedule a call with them too.

It might be helpful as we start to explore pay for this service to field some other suggestions about infrastructure providers for Mac. Either Googling options or reaching out to friends in tech generally (probably DevOps specifically) for suggestions/recommendations. Once we get a better idea of what is out there, we can assess what is feasible for us to use from both financial and maintenance perspectives.

scopatz commented 7 years ago

My personal opinion is that if we are going to fund raise and shell out that kind of money, we'd be better off going with the full stack solution, like travis or circle. If we are only going to pay for infrastructure, we'd still be setting up something special to manage it. In this case it would be better to do that as an "@home" solution. I think it would allow better scaling of infrastructure over our diverse group.

I know that HT Condor does run on mac and would provide a mechanism for mac hosts.

ocefpaf commented 7 years ago

My personal opinion is that if we are going to fund raise and shell out that kind of money, we'd be better off going with the full stack solution, like travis or circle

I agree. (Even though the price scares me.)

tkelman commented 7 years ago

How much of the queue comes from the staged-recipes repo vs everything else? Has the option been considered of moving feedstocks and as many other repos as possible (aside from staged-recipes, assuming that's the busiest?) to different github organizations? It may complicate some code and permissions handling, but would multiply the effective travis concurrency if the load is distributed across many repos.

jakirkham commented 7 years ago

I don't think Travis gives us the stats to figure that out. Though TBH since they added more workers to the pool and fixed up some other issues, it seems the queue has been doing much better.

tkelman commented 7 years ago

It's possible to ask. They have a slack channel where other projects have asked about that kind of data, though I think it's invite only (mostly community language maintainers there). Might be able to ask for someone affiliated with conda-forge to be invited there.

scopatz commented 7 years ago

Hey All, I just got off the phone with Travis and I think there are some better options available that the dedicated infrastructure proposed before.

Basically, for $250 per month we can purchase 5 more concurrent jobs on the travis.org infrastructure. This can be purchased as many times as we want. That is, for $1000 per month we could get 20 more concurrent jobs.

Additionally, Travis offers a 30% discount to non-profits. So, for instrance, if Travis was able to bill Numfocus, then we could knock the $250 down to $175.

This seems much more reasonable overall and could eliminate many of our potential queue issues. I am not sure how many concurrent jobs we'd really need, but this framework would allow us to scale up or down depending on our needs and funding.

Are we talking with Numfocus already about support / fiscal sponsorship? Sorry if I have been out of it for the past month or so. I am happy to bring this Mac issue up with them if no one else has.

ocefpaf commented 7 years ago

Are we talking with Numfocus already about support / fiscal sponsorship?

Nope. The conversation is stalled.

Sorry if I have been out of it for the past month or so. I am happy to bring this Mac issue up with them if no one else has.

Yes please :grimacing:

(Thanks for looking into this!)

scopatz commented 7 years ago

Nope. The conversation is stalled.

Who was chatting to them before?

ocefpaf commented 7 years ago

Who was chatting to them before?

@jjhelmus

jakirkham commented 7 years ago

FWIW I think Travis CI has largely fixed the massive queues as noted in this comment. Though looking at their status page for the past couple weeks is also informative. That doesn't mean we wouldn't want to increase our concurrency, but it is just something to factor in before making a financial decision.

patricksnape commented 7 years ago

Well that's interesting thanks @scopatz! Would be good to have a centralized place for payments if we go that route (such as numfocus or something) so that we delegate that responsibility.

wesm commented 7 years ago

Would you accept an implicit donation of funds from other NumFOCUS sponsored projects (i.e. we would pay the invoice)? pandas and other projects derive enough value from conda-forge, that a number of us together could foot the bill (assuming there is suitable consensus) for the extra Travis CI capacity.

scopatz commented 7 years ago

@wesm - a discussion of NF supporting all orgs did come up. It appears that for every 2 github orgs under the same bill, they will throw in an extra 5 concurrent jobs. I think this would be a great thing to bring up so that all NF projects could get a tangible benefit from being in NF.

For purposes here, NF would still have to be happy with allowing conda-forge to be part of that.

tkelman commented 7 years ago

JuliaLang has been paying for extra Travis capacity (with funding routed through NumFocus) for a while, same basic plan described above - though I don't think we've been getting a nonprofit discount. If it meant more workers for someone and otherwise didn't hurt anything, we could look into combining somehow.

patricksnape commented 7 years ago

@scopatz that would be awesome, seems like grouping resources would be most cost effective. @wesm that would be very generous and it sounds like some support from numfocus would be very valuable for conda forge

jakirkham commented 6 years ago

Probably time to revisit some options on this front in light of Travis CI's proposed changes.

wesm commented 6 years ago

I would be surprised if we couldn't put together at least $50K or $100K annually in support of CI capacity for Travis CI. I can help with the fundraising, but we need a fiscal conduit (e.g. NumFOCUS)

ocefpaf commented 6 years ago

@wesm I am in touch with NumFOCUS and hopefully soon we'll be a fiscally sponsored project. I'll keep you informed on how this goes.

ChrisBarker-NOAA commented 6 years ago

@wesm https://github.com/wesm I am in touch with NumFOCUS and hopefully soon we'll be a fiscally sponsored project.

Thanks!

Anything the rest of us can to help?

-CHB