RichiH / conference_proceedings

Proceedings of various conferences
59 stars 19 forks source link

git annex find --not --in web #22

Open joeyh opened 10 years ago

joeyh commented 10 years ago

Finds a few files. This would be a good regression test for this repo.

RichiH commented 10 years ago

I am somewhat unclear on how this works... Does it read the remote webpages and tell me what else is available? How else does it get at that data? And as git-annex apparently knows new files exist: what is the canonical way to just add that data to the annex?

joeyh commented 10 years ago

Richard Hartmann wrote:

I am somewhat unclear on how this works... Does it read the remote webpages and tell me what else is available? How else does it get at that data? And as git-annex apparently knows new files exist: what is the canonical way to just add that data to the annex?

No, this is just finding files in the annex that do not have a url recorded, so git annex get is not going to be able to get them when someone clones this repository.

see shy jo

Millak commented 10 years ago

I ran this on my repository, got: efraim@debian-netbook:~/conference_proceedings$ git annex find --not --in web FOSDEM/2007/md5sum.txt FOSDEM/2008/devrooms/debian/LICENCE FOSDEM/2008/devrooms/debian/MIRROR_ON_YOUR_RISK___READ_THE_README FOSDEM/2008/devrooms/debian/ogg_theora/720x576/MD5SUMS.txt FOSDEM/2008/devrooms/opensuse/MD5SUMS FOSDEM/2008/devrooms/opensuse/README FOSDEM/2008/devrooms/xorg/README all files that should be unannexed.

then I checked out upstream/master, found 2 more files: Linux_Conference_Australia/2014/Monday/lca2014_monday_keynote.mp4 Linux_Conference_Australia/2014/Tuesday/lca2014_tuesday_keynote.mp4 neither of these files are in my repo, but based on the other files those were probably renamed on the LCA side.

while 'git annex find --not --in web' will find files with no web remote, the only thing I can think of that would make sure that the web remote actually contained the file would be something like 'git annex fsck --from web', but currently we're well over 500GB. Is there some way to ping the files to check if there are actually files on the other end that match the filesize (at least for those added with --fast and not --relaxed) without downloading the whole file?

joeyh commented 10 years ago

Efraim Flashner wrote:

while 'git annex find --not --in web' will find files with no web remote, the only thing I can think of that would make sure that the web remote actually contained the file would be something like 'git annex fsck --from web', but currently we're well over 500GB. Is there some way to ping the files to check if there are actually files on the other end that match the filesize (at least for those added with --fast and not --relaxed) without downloading the whole file?

git annex fsck --fast --from web should do that, checking only that the web has the files and that they're of the expected size.

Note that it might be a good idea for this repository to always use git annex addurl --fast; this generates keys that have no defined size, so if a video is later edited or re-encoded, git-annex won't care.

see shy jo

RichiH commented 8 years ago

@joeyh Your reply seems to be empty. Can you resend?

Ideally, there would be an "open" get which gets all content and offers to change everything that's changed. I.e. a global web remote update function. Could you reasonably implement that?

RichiH commented 8 years ago

@joeyh: @clacke sent in two PRs.

clacke commented 8 years ago

I don't think my two PRs are related to this, but I'll chime in and say that find --not --in web is great for discovering that somebody merged master without also merging (the right) git-annex. That's probably the only situation this catches, but it's an important one.

clacke commented 8 years ago

I have been trying to use fsck --fast --from web, but ran into more trouble than it's been worth. First of all, never run it on the repo that has the hack that uses the web uuid as uuid. :-)

But I think it has surprising behaviors even when running on a "normal" repo. For FOSDEM it always returns false even though the files are there, not sure why. Maybe it's the redirect? (videos.fosdem.org redirects in a round-robin fashion to one of 8-9 mirrors)

Maybe fsck --from web works better, but then you are looking at gigabytes of downloads.

EDIT: And now I realized I've basically restated what @millak said two years ago. I'll let it stay, because I think I said it slightly differently, which may help. :-)

clacke commented 8 years ago

Possibly find --not --in web also catches when somebody has accidentally made git-annex think that all files disappeared from upstream, by running fsck --fast --from web.

joeyh commented 8 years ago

Claes Wallin (韋嘉誠) wrote:

But I think it has surprising behaviors even when running on a "normal" repo. For FOSDEM it always returns false even though the files are there, not sure why. Maybe it's the redirect? (videos.fosdem.org redirects in a round-robin fashion to one of 8-9 mirrors)

fsck --fast --from web notices when urls are no longer accessible (after following redirects), or when the size of the content at the url differs from the recorded size of the file. addurl --relaxed avoids the latter check.

see shy jo

clacke commented 8 years ago

Does that mean I should ask for a feature fsck --relaxed?

(I think those URLs were relaxed, but I can't be sure now)