Dont fetch git origin multiple times per run

njam commented 10 years ago

I noticed that librarian fetches the origin multiple times for the same repo, because both the git-"source" and the git-"repository" are new objects for every dependency.

With attached code I introduced a git-"source"-wide cache for which "uri+remote+ref" combinations have been fetched and what the current "sha" is.

This speeds up installs and updates substantially if multiple dependencies come from the same git repo.

Please review.

yfeldblum commented 10 years ago

I like the idea. However the cache should not be on the class object. Rather, the cache should somehow be scoped to the environment (the semiglobal context object).

njam commented 10 years ago

A-ha, makes sense! Should there be a environment.fetch_git_hash(repository, uri, ref) or a generic key-value-store for the environment like environment.cache_get(key, lambda)? Other ideas?

njam commented 10 years ago

I went ahead and introduced a environment.cache(key, getter). Let me know what you think about this solution (and the naming).

njam commented 10 years ago

I renamed the function to environment.runtime_cache(key, getter), might be a bit clearer?

njam commented 10 years ago

Something like this? :)

njam commented 10 years ago

@yfeldblum I merged in upstream, and added remote-fetching if the given sha is not found in the current repo's log. What do you think?

(somehow travis doesn't post the build status, here it is: https://travis-ci.org/njam/librarian/builds/18083075)

yfeldblum commented 10 years ago

Sweet.

Please use the class name as keyspace, and use the keyspace method (new).

def rtc
  @rtc ||= environment.runtime_cache.keyspace(self.class.name)
end

Please use once instead of memo (new) when we don't need the return value of the block.

def fetch(remote)
  rtc.once { fetch_real(remote) } # or inline with do...end
end

(fetch_hash should still use memo because we need the return value of the block.)

Please avoid increasing the git ops count in all cases.

Please rebase & squash.

njam commented 10 years ago

Thanks for the feedback, I pushed a new version!

I introduced a "keyspaced" method runtime_cache, or did you want it to be called rtc actually?

Regarding the change in git ops: I think in these two cases (sha is provided and already checked out) I can't keep the 3 instead of 4 ops. But: These are only local operations, the far slower fetch will be executed less often instead. So overall the speed improvements should be significant. Previously when a given sha was not checked out, it would just fetch from the remote. Now in that scenario I first want to check whether that sha is already in the repo's history, to know whether it makes sense to fetch or not. Do you agree, or am I missing something?

Before:

/usr/local/bin/git reset --hard --quiet
/usr/local/bin/git clean -x -d --force --force
/usr/local/bin/git rev-parse HEAD --quiet

After:

/usr/local/bin/git reset --hard --quiet
/usr/local/bin/git clean -x -d --force --force
/usr/local/bin/git log -1 --no-color --format=tformat:%H ebd4a74e26c18b65cd966137dc3296c4091cbba5
/usr/local/bin/git rev-parse HEAD --quiet

njam commented 10 years ago

... looks like public_send was introduced only in ruby 1.9.1. I now included this "backports" gem as suggested by this stackoverflow answer for ruby 1.8.7 compatibility. What do you think? (again, travis: https://travis-ci.org/njam/librarian/builds/18103140)

yfeldblum commented 10 years ago

Fixed.

yfeldblum commented 10 years ago

Obviously I should have had a test case for that so Travis would catch it early!

yfeldblum commented 10 years ago

reset & clean can be done just once per process to reduce ops count.

njam commented 10 years ago

Thank you for the great work and all the tests, much appreciated!

applicationsonline / librarian

Dont fetch git origin multiple times per run #163