CocoaPods / CocoaPods

The Cocoa Dependency Manager.
https://cocoapods.org/
Other
14.57k stars 2.63k forks source link

Issues Cloning Spec repo - GitHub taking a very long time to download changes to the Specs Repo #4989

Closed jlubeck closed 8 years ago

jlubeck commented 8 years ago

Note from @orta -

If you are here because your Specs repo isn't updating, run: cd ~/.cocoapods/repos/master && git fetch --depth=2147483647 - this will convert your local repository of Podspecs to be a full clone, as opposed to a shallow copy.


What did you do?

Run pod setup

What did you expected to happen?

Clone Spec repo master

What happened instead?

It only downloads a few bytes and then throws error:

fatal: unable to access 'https://github.com/CocoaPods/Specs.git/': transfer closed with outstanding read data remaining

Podfile

No Podfile yet

I also tried cloning the repo manually or with the githhub desktop app with no avail. I´m having no issues cloning any other repo in github. Only with this one. Is it possible there is something wrong with it???

Thanks

jlubeck commented 8 years ago

Tried again, new error:

error: RPC failed; result=18, HTTP code = 200 fatal: The remote end hung up unexpectedly [!] /usr/bin/git clone https://github.com/CocoaPods/Specs.git master --depth=1

Cloning into 'master'... error: RPC failed; result=18, HTTP code = 200 fatal: The remote end hung up unexpectedly

Very weird...

ecch531 commented 8 years ago

I got same issue, too.

mangofever commented 8 years ago

Same. Cannot clone spec repo.

art-divin commented 8 years ago

+1, same issue git clone https://github.com/CocoaPods/Specs.git takes forever

stringsanbu commented 8 years ago

+1, been messing around with this for awhile. I doubled the buffer, didn't work. Uninstalled and reinstalled pods, didn't work. Tried to clone manually, no cigar. It actually seems to be getting "something" but fails. Using verbose didn't say much, just said it had issues accessing it.

I tried accessing my other repos and it seemed to be OK, but it was definitely slower than normal.

huinme commented 8 years ago

+1

I got same issue, too. my pod version was "0.39.0"

I tried cloning master repos directly (by git clone git@github.com:CocoaPods/Specs.git master --depth=1 --verbose), but also failed.

oronbz commented 8 years ago

+1

aceontech commented 8 years ago

+1

pedrocid commented 8 years ago

+1

MarkMolina commented 8 years ago

+1. No success after increasing buffer / reinstall / manual clone

stringsanbu commented 8 years ago

Temporary workaround which might work: https://github.com/CocoaPods/Specs/archive/master.zip haven't tested though

wget https://github.com/CocoaPods/Specs/archive/master.zip
samuel-mellert commented 8 years ago

https://github.com/CocoaPods/Specs/archive/master.zip would be the correct link, I guess?

fraxool commented 8 years ago

Same error here. What should we do with the file at https://github.com/CocoaPods/Specs/archive/master.zip ?

stringsanbu commented 8 years ago

My bad, the wget link is correct. Just edited my first link.

stringsanbu commented 8 years ago

Not sure what to do with the file yet. Trying to see if we can manually run the commands to have pod setup work.

aceontech commented 8 years ago

Yeah, it merely downloads the repo's contents. The .git/ directory is missing, so it's not recognized as a git repo.

samuel-mellert commented 8 years ago

yes.. same here.. It always tries to clone the master repo. Even when I run it with --no-repo-update I get "Creating shallow clone of spec repo master-1 from https://github.com/CocoaPods/Specs.git"

MarkMolina commented 8 years ago

Did anyone try this with 1.0.0 beta 4?

huinme commented 8 years ago

@MarkMolina I tried, but same result.

czechboy0 commented 8 years ago

Try to cd into ~/.cocoapods/repos/master, then git clean -fd to clean up the working copy, git checkout -- . to ensure you're on master and then git pull manually. This took ages but worked for me.

SoundBlaster commented 8 years ago

+1

aceontech commented 8 years ago

Thx, but I removed my master spec repo before I realized something was up with the Github repo ^.^

stringsanbu commented 8 years ago

Got a temp workaround! Tested with my app and everything is working. This is really only needed if you deleted the master repo. If the master folder is still in your ~/.cocoapods/repos folder with contents then you should be ok to just use pod install --no-repo-update.

And you should be good to go!

So in short, here is the basic list of commands I used:

pod setup (in a separate tab)
mv ~/.cocoapods/repos/master/.git ~/tempSpecsGitFolder
^C on pod setup tab
wget https://github.com/CocoaPods/Specs/archive/master.zip
open master.zip (unzipping it)
mv Specs-master ~/.cocoapods/repos/master
mv ~/tempSpecsGitFolder ~/.cocoapods/repos/master/.git
cd [project folder]
pod install --no-repo-update
aceontech commented 8 years ago

Is this a Cocoapods or a wider GitHub issue?

stringsanbu commented 8 years ago

@aceontech Pretty sure it is a GitHub issue, but my other repos are working fine so perhaps only certain repos on certain servers (on their backend) are affected.

aceontech commented 8 years ago

I was just able to do a successful pod setup. Don't know if it's repeatable.

AlexMacBookPro:repos alex$ pod setup --verbose

Setting up CocoaPods master repo

Creating shallow clone of spec repo `master` from `https://github.com/CocoaPods/Specs.git` (branch `master`)
  $ /usr/bin/git clone https://github.com/CocoaPods/Specs.git master --depth=1
  Cloning into 'master'...
  Checking out files: 100% (74393/74393), done.
  $ /usr/bin/git checkout master
  Already on 'master'
  Your branch is up-to-date with 'origin/master'.
SoundBlaster commented 8 years ago

Github is very very slow: ~ 40-50KB/s

pedrocid commented 8 years ago

I was able to do pod setup -- verbose right now.

segiddins commented 8 years ago

This is a GitHub issue rather than a cocoapods issue -- you're best off reporting it to their support rather than us, since there's nothing we can do about it.

tychop commented 8 years ago

And github refers to Cocoapods... Great...

segiddins commented 8 years ago

There's nothing we can do -- github are the ones who host the repo and are responsible for serving it. The only commits in the past day have been changing files via their REST API, so the idea a bad commit got in is very unlikely. For the meantime, installing using --no-repo-update if you already have the master repo cloned is probably the best bet.

orta commented 8 years ago

I've contacted support about it, hopefully everything should be pretty easy to fix

jcampbell05 commented 8 years ago

I was able to clone the repo manually, do we know what is special about the way CocoaPods clones it that we could use to help github ?

orta commented 8 years ago

CocoaPods uses this git command line API, it's calling git clone https://github.com/CocoaPods/Specs.git - I wonder if the problems are location specific, as these commands aren't working for me in NYC.

art-divin commented 8 years ago

I've also contacted GitHub support today, no answer still. I think that the issue is not location-based since there's plenty of distance between NYC & Munich.

Issue was "coming and going" during morning hours, GitHub status page did not reveal any problems during outage.

What would be a really good addition to CocoaPods is a possibility to change CocoaPods specs repository URL to use in-house replication of it

jcampbell05 commented 8 years ago

Potentially mirriors could be hosted at bitbucket or other providers ?

segiddins commented 8 years ago

@jcampbell05 right now, Pod::Source doesn't know how to deal with mirrors, so it wouldn't be much help

mhagger commented 8 years ago

Hey all,

I'm one of the engineers on GitHub's Git infrastructure team. I'd like to start by apologizing for not responding more quickly to this thread. We've been investigating the issues that the CocoaPods community has been experiencing, and I wanted to give you an update on what we have found out so far.

The slow fetches and clones (which sometimes time out) that the CocoaPods community is experiencing are caused by automatic rate limiting on our servers, which is done to make sure that extremely high levels of load in one repository cannot impact other GitHub users. The CocoaPods/Specs repository is more or less permanently being rate limited. Why? There are several factors coming together:

  1. This repository experiences a huge volume of fetches (multiple fetches per second on average). We understand that part of the CocoaPods workflow is that its end users (i.e., not just the people contributing to CocoaPods/Specs) fetch regularly from GitHub, but the results of this are painful for our infrastructure: there have been approximately 1.1 Million clones/fetches from CocoaPods/Specs in the past week. This activity has kept, on average, more than 5 server CPUs permanently pegged, and used several terabytes of bandwidth out of our datacenters. There are only a handful of other repositories in all of GitHub that even come close to this level of activity. As far as I know, this level of activity is not new, but has been going on for many months and probably longer. Suffice it to say that the name CocoaPods/Specs is quite well known within our team :wink:
  2. Apparently, most of the initial clones are shallow, meaning that not the whole history is fetched, but just the top commit. But then subsequent fetches don't use the --depth=1 option. Ironically, this practice can be much more expensive than full fetches/clones, especially over the long term. It is usually preferable to pay the price of a full clone once, then incrementally fetch into the repository, because then Git is better able to negotiate the minimum set of changes that have to be transferred to bring the clone up to date.
  3. Moreover, you seem to be hitting an edge case in Git's shallow fetch support, which is causing a significant fraction of your users' fetches to consume disproportionate CPU time (i.e., 100+ seconds each) on our servers. When this happens, the shallow clones are being converted into nearly-full clones, in a way that is much more expensive than doing a full clone from the start.
  4. Finally, the layout of the repo itself doesn't help. Specifically, the Specs directory, which contains 16k+ subdirectories, causes some Git operations to be unexpectedly expensive, further driving up CPU usage.

All of these factors combine to make CocoaPods/Specs one of the top five most resource-costly repositories that we host on all of GitHub.com. And that is why it is rate-limited; otherwise it would consume even more resources and cause service interruptions for other GitHub users. The symptoms of the rate limiting for you and your users are that your repository accesses (clones, fetches, pushes) have to wait in a queue on our end, sometimes for a long time, before being processed. This causes fetches/clones to take much longer than they would otherwise, and might cause timeouts at your end. Moreover, if the load on our servers becomes too overwhelming, a fraction of the accesses might be rejected altogether.

So, what can we do about it?

First and foremost, let me make reiterate our commitment to hosting Open Source projects for free, forever. Our platform doesn't have "hard limits" or monthly traffic quotas. But the same commitment we have towards CocoaPods we also have towards all the other OSS projects that share their storage hosts with your project, and that simply wouldn't be able to operate if our automatic monitoring didn't throttle access to the CocoaPods/Specs repository.

That said, we're working in the open-source Git project on patches to fix the pathological behavior you're experiencing (e.g., see http://thread.gmane.org/gmane.comp.version-control.git/288403). We think Git's handling of shallow clones can be improved, but this might take a while. If the Git client needs to be changed, it wouldn't help until the new client is in the hands of the majority of your users.

The remaining issues, however, are mostly in the hands of the CocoaPods project. I have the feeling that the easiest possible first step would be to address point 2, by changing CocoaPods to use full rather than shallow clones. I assume that the typical clone is updated many times during its lifetime, in which case the initial cost of the larger clone should easily pay off over time while significantly decreasing the load on our servers. Existing clones can be converted from shallow to deep by running

git fetch --depth=2147483647

within the repository.

I believe that the change to using non-shallow clones will start reducing the cost of fetches, which will automatically cause the rate limits imposed by our systems to be loosened, ultimately giving a much better experience to the users of CocoaPods.

Longer-term, you should also consider points 1 and 4. Using GitHub as your CDN is not ideal, for anybody involved. I would urge you to consider how CocoaPods could be distributed without using Git operations, which are intrinsically hard to scale. I'm confident that you could come up with a more reliable approach for serving packages. Perhaps a method that is more similar to the approaches used by other packaging systems would work better.

I hope this information is helpful. Please let us know if you have any questions!

jcampbell05 commented 8 years ago

@mhagger would HTTP fetching be easier to scale ?

mhagger commented 8 years ago

@jcampbell05: unfortunately, HTTPS vs SSH wouldn't make a noticeable difference. The expensive part is figuring out which Git objects the client already has, which ones it needs, computing deltas for those objects, and compressing the deltas. When the client has a non-shallow history, the first two steps become much cheaper and the last two steps can often be optimized away entirely.

jcampbell05 commented 8 years ago

@mhagger What I was meaning is that you can directly link to files via HTTP using the raw.githubusercontent.com domain. If we were to download some things via HTTP directly rather than git would that help ?

orta commented 8 years ago

I've removed a post noting that I wish we could have been told about the burden earlier so we could have helped out before hitting a ceiling, however, I can imagine it's difficult on your side to keep people in the loop about things like this. Sorry, don't want to de-rail!

jcampbell05 commented 8 years ago

I've left some ideas here to help with the above but I'm not sure if they will help https://github.com/CocoaPods/CocoaPods/issues/5000

I'm very passionate about us getting a deploy command at some point (Works just like bundler's bundle install --deployment).

MikeMcQuaid commented 8 years ago

Hi, another GitHub employee here from the Platform (i.e. API) team and Homebrew maintainer (so I feel the pain of both sides).

If we were to download some things via HTTP directly rather than git would that help ?

It would help if you were using e.g. master.tar.gz tarballs as they can be more easily cached and served without hitting the Git layer every time. The problem from your side is that you'd need to do a ~60MB download every time so I can see this being undesirable.

As well as the shallow changes @mhagger suggested this new, preview API should help: https://developer.github.com/changes/2016-02-24-commit-reference-sha-api/. It's helped Homebrew dramatically reduce the number of no-op git fetchs which also will make things better for your users as a no-op API HTTP call is significantly faster for you (and less expensive for GitHub) than a no-op git fetch. Feel free to @mention me directly on any pull request implementing it so I can help you ensure you're caching it nicely.

jcampbell05 commented 8 years ago

@mikemcquaid That looks like it will be a huge help, thank you!. I'm sure @segiddins, @alloy or @orta will get in touch with their thoughts :) :rocket:

For me a three-tier approach may be best:

alloy commented 8 years ago

@mhagger

I'm one of the engineers on GitHub's Git infrastructure team. I'd like to start by apologizing for not responding more quickly to this thread.

No worries and thanks for jumping on this at all :+1:

I knew that GitHub must spend a sizeable amount of resources on making a repo like CocoaPods/Specs available for ‘free’ to all our users before, but some of the information you’ve now given makes that even clearer.

So in name of all CP users, first of all, thanks for all that :clap:

With all the hugs and kisses out of the way, let’s get onto sorting this all out. I’ll try to focus on what I think is important for this discussion, but please do point it out if I overlooked important information from your message!


Longer-term, you should also consider points 1

It’s unclear to me what it is in point 1 specifically that we should consider. Can you make that more explicit?

and 4.

This point seems an interesting tidbit, but it’s not clear to me at all why this is the case. Do you have links for us to read-up on this?

Using GitHub as your CDN is not ideal, for anybody involved. I would urge you to consider how CocoaPods could be distributed without using Git operations, which are intrinsically hard to scale. […]

There are a few reasons why we decided to go this route:

Perhaps a method that is more similar to the approaches used by other packaging systems would work better.

For the ‘HR’ and funding reasons listed above, I think we’re actually being ‘smarter’ than various other packaging systems. I’m not going to name them, but I’m sure you can think of examples.

I'm confident that you could come up with a more reliable approach for serving packages.

I’m not at all afraid that we as devs can’t come up with all sorts of solutions :wink:, but I’d like to stay away from immediately assuming that things cannot work at all with the current design and ending up building a cathedral.

I.e. I’d like us to continue this discussion, at first, from the notion of us maintaining the existing architecture. Where things are absolutely impossible, it would be great if you can include more links to docs/source that explain why things are impossible.

Maybe we could host a snapshot of the git repo as a ‘release’ and initially download that?

In addition, reading the linked to bug report, I’m not entirely sure I understand if shallow clones are or are not able to work in any feasible way right now. Could you expand on that? E.g. the bug report thread mentions various options, such as “--deepen, --shallow-since and --shallow-exclude”, could any of these be helpful to us in any way?

alloy commented 8 years ago

@mikemcquaid

It would help if you were using e.g. master.tar.gz tarballs as they can be more easily cached and served without hitting the Git layer every time. The problem from your side is that you'd need to do a ~60MB download every time so I can see this being undesirable.

You are referring to these, yeah?

screen shot 2016-03-08 at 15 48 20

Yeah that kinda sounds like my idea, except I’d like that to be a one time thing.

I should have stated in my earlier comment that my idea of hosting a snapshot was meant as a way for users to more easily get a full clone, which, as I understand it, would take the shallow/server-side CPU usage burden away?

As well as the shallow changes @mhagger suggested this new, preview API should help: https://developer.github.com/changes/2016-02-24-commit-reference-sha-api/. It's helped Homebrew dramatically reduce the number of no-op git fetchs

This looks very interesting, thanks for sharing!

Just to be clear, are the number of no-op git fetchs currently a burden that’s leading to the rate-limiting as well?

MikeMcQuaid commented 8 years ago

You are referring to these, yeah?

@alloy I am, yep.

Yeah that kinda sounds like my idea, except I’d like that to be a one time thing.

Sure. Unfortunately that archive is the output of git archive so does not include any .git directory/metadata.

Just to be clear, are the number of no-op git fetchs currently a burden that’s leading to the rate-limiting as well?

That's something that's hard for me to identify exactly. I guess it's a question of how often you think users are running git fetch (or equivalent) when there's nothing new to download. My experience locally is that a no-op git fetch for this repository is extremely slow so it's probably worth implementing just for that case and it definitely will decrease load for GitHub rather than increase it.

mhagger commented 8 years ago

and 4.

This point seems an interesting tidbit, but it’s not clear to me at all why this is the case. Do you have links for us to read-up on this?

@alloy: In the Git object model, each version of each directory is stored as a "tree" object. Whenever something changes under the directory, a whole new, modified copy of the tree object has to be written to the object database. The Specs directory has 16k+ entries, and is about 450kb in size (compressed). Every single commit requires a new version of this giant tree.

This superficially doesn't seem so bad, because usually only a single entry in the tree changes each time. So successive versions of the tree delta well against each other, and the repository doesn't explode in size.

The problem is that many Git operations have to traverse the tree, which means that internally the 450kb object has to be recreated from its deltas (usually through multiple steps of deltas, each of which has to be found and decompressed). And your repository has nearly 100k commits, so operations that need to traverse the whole history become extremely expensive.

If, for example, this directory were sharded into subdirectories based on the first and second letters of the package name like so

a/_/A
a/f/A-Framework
a/2/A2DynamicDelegate
a/2/A2StoryboardSegueContext
a/3/A3GridTableView
...
a/u/authorizenet-sdk
a/u/autoAutoLayout
b/6/B68UIFloatLabelTextField
b/a/BABAudioPlayer
b/a/BABCropperView
b/a/BABFrameObservingInputAccessoryView
...
z/i/zipzap
z/l/zldtest
z/l/zldzhang
z/x/zxcvbn-ios

then the Specs directory and its subdirectories would only have 26ish entries, and the next level of directories would all have fewer than a few hundred entries. A modification in such a directory layout would have to rewrite three trees instead of one, but each tree is so much smaller than the current Specs tree that it would nevertheless be a big win.

Such a layout is also a big win for many other reasons. For example, when computing diffs, if two Specs trees have identical a subdirectories, then that can be seen without looking inside the subdirectory's tree at all (because the SHA-1s of the trees would be identical). So computing the diff between two successive versions in the sharded scheme probably only requires a few (small) trees to be opened and a few dozen SHA-1s to be compared, whereas today it requires two gigantic trees to be opened and 16k SHA-1s to be compared.

vmg commented 8 years ago

Thanks for your thoughtful reply, @alloy. Hope @mhagger has cleared up the question about your large trees. Regarding your other points:

It’s unclear to me what it is in point 1 specifically that we should consider. Can you make that more explicit?

Point 1 basically refers to using GitHub as a CDN. We totally understand this is convenient for you, and we work hard around the clock to make this a viable option, but Git, by design, is not suited to act as a CDN. You're burning weeks of CPU time and gigabytes of bandwidth from our infrastructure that could be replaced with very little CPU and very little bandwidth if CocoaPods were using a more traditional design for a package management system.

Maybe we could host a snapshot of the git repo as a ‘release’ and initially download that?

This would not be a strict improvement. If you use the tarballs that we offer for download, you will not have the Git metadata for the repository, so further fetches won't be possible. It'd be just as cheap to perform a full clone through Git -- GitHub has a special implementation on the server-side that can make serving a full clone particularly cheap as long as not a shallow clone. And obviously, you can continue fetching on top of the original clone.

To reiterate: the major performance issue is not on doing an initial clone of the CocoaPods repository, but in performing a shallow clone and then repeatedly fetching into it, like the CocoaPods client is currently doing.

I’m not entirely sure I understand if shallow clones are or are not able to work in any feasible way right now. Could you expand on that? E.g. the bug report thread mentions various options, such as “--deepen, --shallow-since and --shallow-exclude”, could any of these be helpful to us in any way?

Our advice would be for CocoaPods to stop using any kind of shallow feature from Git altogether. Users should perform a full clone of the repository, and then fetch into it as usual. Simply performing that change should significantly soften the load on our fileservers.

You may be led to believe that this is inefficient (in bandwidth or disk storage), but it actually ends up being significantly cheaper than your current approach. Git is not very good at shallow data, and one pattern we've found (and that we're trying to fix upstream in Git itself) is that merging a branch and fetching that into a shallow repository can cause Git to send an unreasonable amount of objects when that merge crosses the grafted shallow-point of the repository. You can read the investigation in the Git ML here: http://thread.gmane.org/gmane.comp.version-control.git/288403

Besides dropping the shallow clones, I would still urge you to implement @mikemcquaid's suggestion regarding the preview API for no-op updates. At this point, most of the throttling comes from expensive fetches, but every small bit helps.

At the end of the day, any Git pattern will "work" in practice: we have a unique in-house monitoring system that ensures the full availability of our Git platform no matter the circumstances. But this obviously leads to issues like the current thread. If the operations you're performing are not as optimal as they could (or are pathological like in this case), they will be automatically throttled or cancelled on our servers, and this is a poor experience for the users of CocoaPods.

We cannot force you to change the design of your package manager, but we'd like to reiterate that Git (the version control system itself -- nothing to do with GitHub as a platform) is unsuited for what you're trying to do here. We're here to help you soften the pain, and we'll continue improving the performance of our platform and of the OSS Git client to make pathological workflows work in practice, but this is hard work. We can't assure an ideal user experience with CocoaPod's design choices. :crying_cat_face: