gb vendor fetch: do not check out same remote repository for different import paths

When one runs gb vendor fetch , gb calls main.fetch to acquire, copy a portion of, and then discard its copy of the remote repository. After that, so long as its -no-recurse flag is false, it proceeds to fetch the missing transitive dependencies of the source it's acquired thus far.

The problem arises when one requests fetching of one import path from a repository that yields files that in turn import alternate paths within that same repository. Consider a hypothetical repository:

example.com/org/repo/.git
example.com/org/repo/p1
- file1.go

package p1

import "example.com/org/repo/p2"

var P p2.Something

example.com/org/repo/p2
- file2.go

package p2

type Something string

If one runs

gb vendor fetch example.com/org/repo/p1

then gb will fetch the repository example.com/org/repo, copy the p1 path within it, then proceed to fetch the same repository again, then copy the p2 path within it.

This doesn't matter much for small repositories, but for large ones it can take many hours, wasting bandwidth and churning the disk unnecessarily. Consider augmenting main.fetch to remember the set of repositories it's downloaded from its initial top-level invocation, and to destroy them all only when unwinding back up to the top-level. Intermediate recursive invocations could share that repository cache to avoid downloading the same repository more than once.

Yes, this is something I need to fix. It's not just inefficient, it's actually wrong to cherry pick parts of a repo.

On Fri, Sep 23, 2016 at 5:32 AM, Steven E. Harris notifications@github.com wrote:

When one runs gb vendor fetch , gb calls main.fetch https://github.com/constabulary/gb/blob/master/cmd/gb-vendor/fetch.go#L84 to acquire https://github.com/constabulary/gb/blob/master/cmd/gb-vendor/fetch.go#L103, copy a portion of https://github.com/constabulary/gb/blob/master/cmd/gb-vendor/fetch.go#L134, and then discard its copy of the remote repository https://github.com/constabulary/gb/blob/master/cmd/gb-vendor/fetch.go#L142. After that, so long as its -no-recurse flag https://github.com/constabulary/gb/blob/master/cmd/gb-vendor/fetch.go#L40 is false, it proceeds to fetch the missing transitive dependencies of the source it's acquired thus far https://github.com/constabulary/gb/blob/master/cmd/gb-vendor/fetch.go#L195 .

The problem arises when one requests fetching of one import path from a repository that yields files that in turn import alternate paths within that same repository. Consider a hypothetical repository:

example.com/org/repo/.git http://example.com/org/repo/.git

example.com/org/repo/p1 http://example.com/org/repo/p1

file1.go

package p1 import "example.com/org/repo/p2" var P p2.Something

example.com/org/repo/p2 http://example.com/org/repo/p2

file2.go

package p2 type Something string

If one runs

gb vendor fetch example.com/org/repo/p1

then gb will fetch the repository example.com/org/repo http://example.com/org/repo, copy the p1 path within it, then proceed to fetch the same repository again, then copy the p2 path within it.

This doesn't matter much for small repositories, but for large ones it can take many hours, wasting bandwidth and churning the disk unnecessarily. Consider augmenting main.fetch to remember the set of repositories it's downloaded from its initial top-level invocation, and to destroy them all only when unwinding back up to the top-level. Intermediate recursive invocations could share that repository cache to avoid downloading the same repository more than once.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/constabulary/gb/issues/645, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcAzh_fdDiGNKSxgQBm2tjPN3pAGd_ks5qste2gaJpZM4KERkt .

constabulary / gb

gb vendor fetch: do not check out same remote repository for different import paths #645