christianspecht / scm-backup

Makes offline backups of your cloud hosted source code repositories
https://scm-backup.org/
GNU General Public License v3.0
58 stars 20 forks source link

Adding basic LFS support #62

Open evolu1 opened 3 years ago

evolu1 commented 3 years ago

Concerning issue #21

Basic solution to backup a repository including the real LFS files (and not just the pointer files).

I would love to see LFS support implemented. It's my first contribution to a bigger repository. I'm happy to improve this pull request.

christianspecht commented 3 years ago

Hi,

thanks for your contribution. I have zero experience with Git LFS, only a basic idea what it does...so a few questions:

You wrote that lfs fetch --all works for LFS and non-LFS repos, but what happens if the local Git installation doesn't have LFS installed/enabled?

Would the call throw an exception? Would the user somehow be able to notice that SCM Backup skipped their large files?

If not, it would be nice if SCM Backup could detect this somehow and warn the user. Any idea if that's possible?
I did a quick Google search and getting the information "yes, this repo contains at least one LFS file" is apparently not that easy.

evolu1 commented 3 years ago

Thanks for the reply.

I thought about this...

Let's say we only want to backup LFS files (with git lfs fetch --all) if the repo really contains these files. The only way I found out to do is in a bare repo, is to get the list of lfs files with git lfs ls-files. The successful result (containing the list of lfs files) is either:

You wrote that lfs fetch --all works for LFS and non-LFS repos, but what happens if the local Git installation doesn't have LFS installed/enabled?

On Windows git lfs is being installed with the git itself. I tested the code in this pull request on Ubuntu and Windows specifically without LFS: If git lfs is not installed scm-backup will end with an error message (InvalidOperationException). One can probably test for a correct installation beforehand when testing for git installation. This worked for me:

string cmd = string.Format("lfs version");
var result = this.ExecuteCommand(cmd);
if (!result.Successful)
{
    //warn the user: git lfs is not installed and possibly existing lfs files are not backed up
    //disable lfs and do not run lfs commands later
} else 
    //git lfs found: enable lfs and run lfs commands
}

In the end it could all look something like this inside the PullFromRemote method in GitScm.cs:

if ( -lfs is installed- )
{
    //test if repo contains lfs files
    string cmd = string.Format("-C \"{0}\" lfs ls-files", directory);
    var result = this.ExecuteCommand(cmd);
    //the program can cause a deadlock here, see below
    if (!result.Successful)
    {
        throw new InvalidOperationException(result.Output);
    }
    if (String.IsNullOrWhiteSpace(result.Output))
    {
        //no lfs files found, continuing
    }
    else
    {
        //lfs files found, backing them up
        string cmd2 = string.Format("-C \"{0}\" lfs fetch --all {1}", directory, remoteUrl);
        var result2 = this.ExecuteCommand(cmd2);
        if (!result2.Successful)
        {
            throw new InvalidOperationException(result2.Output);
        }
    }
}
else
{
    //do nothing
}

This works fine for repos without any LFS files (having an empty LFS file list). But apparently getting a large list of LFS files as a response causes a deadlock. The program waits indefinitely.

One short fix for me was to make the following change in the ExecuteCommand method in CommandLineScm.cs I switch out the original:

...
var proc = Process.Start(info);
var result = new CommandLineResult();
result.StandardError = proc.StandardError.ReadToEnd();
result.StandardOutput = proc.StandardOutput.ReadToEnd();
proc.WaitForExit();

result.ExitCode = proc.ExitCode;
return result;

For this:

...
var proc = Process.Start(info);
var result = new CommandLineResult();
var output = new System.Text.StringBuilder();
var error = new System.Text.StringBuilder();

proc.OutputDataReceived += (sender, eventArgs) => output.AppendLine(eventArgs.Data);
proc.ErrorDataReceived += (sender, eventArgs) => error.AppendLine(eventArgs.Data);
proc.BeginOutputReadLine();
proc.BeginErrorReadLine();
proc.WaitForExit();

result.StandardOutput = output.ToString();
result.StandardError = error.ToString();
result.ExitCode = proc.ExitCode;
return result;

This made a large list of LFS files as response possible and no deadlock occured. But maybe it causes other issues since it modifies the ExecuteCommand method. I have no idea about deadlocks! This solution is somewhat copied from the internet and maybe there is a better solution. Some links I found regarding this: https://csharp.today/how-to-avoid-deadlocks-when-reading-redirected-child-console-in-c-part-2/ https://docs.microsoft.com/en-us/dotnet/api/system.diagnostics.process.standardoutput?view=net-5.0#remarks

Thanks for taking the time. What is your preferred way to continue with this PR?

christianspecht commented 3 years ago

Thanks for the extensive write-up. Sorry for taking so long to answer, but spare time is a precious good at the moment :-/

In general:
You write that you aren't sure about the implications of the "deadlock fix". Without investigation, I'm not sure either.
Maybe we should make LFS support optional at first, so users who need it can can enable it via config setting.
(I have some improvements to the config in the making in this branch, which I'm planning to merge into master soon)

Let's say we only want to backup LFS files (with git lfs fetch --all) if the repo really contains these files. The only way I found out to do is in a bare repo, is to get the list of lfs files with git lfs ls-files.

[...]

On Windows git lfs is being installed with the git itself. I tested the code in this pull request on Ubuntu and Windows specifically without LFS: If git lfs is not installed scm-backup will end with an error message (InvalidOperationException).
One can probably test for a correct installation beforehand when testing for git installation.

I'd like to have methods in IScm for both (checking if the local git installation has LFS enabled, and if repo X contains LFS files(, so we can check this from other places in the code as well (for example, to show a warning to the user).
(There's also a MercurialScm which isn't used at the moment, just make it return false.

Thanks for taking the time. What is your preferred way to continue with this PR?

Thanks for contributing!
How about implementing the additional IScm methods first (and some tests for them, if you have experience with tests). For the tests, we'd need one test repo with 1 or 2 large files.

evolu1 commented 3 years ago

In commit 5323b32a44c6a494eb943c252c4c4cbb7b78bd85 I have added these three methods to the files IScm, CommandLineScm, GitScmand MercurialScm.

bool LFSIsOnThisComputer()
bool RepositoryContainsLFS(string directory)
void PullLFSFromRemote(string remoteUrl, string directory, ScmCredentials credentials)

I executed them inside PullFromRemote for now. When pulling for the normal backup this will be run too:

if (LFSIsOnThisComputer())
{
    if (RepositoryContainsLFS(directory))
    {
        PullLFSFromRemote(remoteUrl, directory, credentials);
    }
}

LFSIsOnThisComputer() can probably be executed once and stored globally. RepositoryContainsLFS(directory)should only be executed if LFSIsOnThisComputer() = true since it is a LFS command.

ExecuteCommand in CommandLineScm is changed to implement the deadlock "fix" I talked about above. Something is not 100% right there. When running scm-backup it eats the "F" in one line showing "ound Git 2.27...". This was the only problem I noticed so far and the backing up is fine.

I am not experienced with testing at all, hence I left it out completely.

The idea to have LFS enabled via config is brilliant. Once enabled by the user the program could use the three LFS methods as well as the experimental deadlock fix.

christianspecht commented 3 years ago

Thank you for the work so far! No problem, I can create the tests.

I created this test repo with two large files: https://github.com/scm-backup-testuser/git-lfs-files

...by doing this:

fsutil file createnew test1.lfs 1048576
fsutil file createnew test2.lfs 5242880

(source: https://stackoverflow.com/a/986041/6884)

git lfs track "*.lfs"

The most basic test for your changes would be cloning this repo with SCM Backup, and verifying that the two large files are actually in the working directory.

But I tested it with the old version of SCM Backup (without your changes) as well. As far as I understand it, the actual large files should be missing in the backup, but there should be some "pointer files" instead.
Correct?

SCM Backup creates a bare repo when backing up...so to view its content, I need to clone it again.
When I try that, I get this:

Cloning into 'C:/Dev/Code/repo'...                              
done.                                                           
Downloading test1.lfs (1.0 MB)                                  
Error downloading object: test1.lfs (30e1495): Smudge error: Err
or downloading test1.lfs (30e14955ebf1352266dc2ff8067e68104607e7
50abb9d3b36582b8af909fcb58): EOF                                

Errors logged to C:\Dev\Code\repo\.git\lfs\logs\20210331T005902.
2214644.log                                                     
Use `git lfs logs last` to view the log.                        
error: external filter 'git-lfs filter-process' failed          
fatal: test1.lfs: smudge filter lfs failed                      
warning: Clone succeeded, but checkout failed.                  
You can inspect what was checked out with 'git status'          
and retry with 'git restore --source=HEAD :/'                   

After that, the LFS files are missing in the working directory, but there are no pointer files either.

So, did I do anything wrong? Did I create the LFS files in the test repo the correct way? If not, could you create a test repo with some LFS files? (with reasonable size, because it will be cloned each time the tests are running)

christianspecht commented 3 years ago

I wrote this:

Maybe we should make LFS support optional at first, so users who need it can can enable it via config setting. (I have some improvements to the config in the making in this branch, which I'm planning to merge into master soon)

--> In the meantime, I merged that branch into master. There's now a ConfigOptions, which you can use for the config setting.

It's still empty in master, but I have more changes in the making in another branch, where you can see how it's intended to be used:

You can copy & paste the GitOptions subclass from there, and instead of the Implementation property, you can create something like public bool UseLfs { get; set; }

And then you can create a new section in the config:

options:
    git:
        useLfs: true

...and get the value from the config like this:

bool lfs = config.Options.Git.UseLfs;
evolu1 commented 3 years ago

Sorry for the delay. I tried a lot of stuff.


Concerning the options file

I merged your current master into this lfs-support fork and added the public bool UseLfs { get; set; } to ConfigOptions.cs But I am not sure how to and where exactly to use it in the different files.

This current code in GitScm.cs is

if (LFSIsOnThisComputer())
      {
          if (RepositoryContainsLFS(directory))
          {
              PullLFSFromRemote(remoteUrl, directory, credentials);
          }
      }

In the end, it should work like this?

Is this the best approach? How do I use/access the bool UseLfs in the GitScm.cs?


Concerning the LFS test repo and files

The most basic test for your changes would be cloning this repo with SCM Backup, and verifying that the two large files are actually in the working directory. But I tested it with the old version of SCM Backup (without your changes) as well. As far as I understand it, the actual large files should be missing in the backup, but there should be some "pointer files" instead. Correct?

Yes and No :) I would expect the same behavior. So to assure you, you did everything correct!

The mentioned error (Smudge error: Error downloading) will be shown when scm-backup creates a backup of a repo including LFS files and the user tries to make a local clone of the backup to restore files.
(This is what you would do normally, but it does not work. From my point of view it is due to LFS and its behavior in general. Below I described the working procedure I tested and use with scm-backup.)

Despite the error and the pointer files not being there in the file system after cloning, you can see the pointer files are included in the commit: test1.lfs looks something like this:

version https://git-lfs.github.com/spec/v1
oid sha256:30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58
size 1048576

For completeness, this is how one can view the desired files you created, cloning directly from a live bare repo:
(here, it is the test repo with your 2 LFS files)

Clone to get real LFS files:

Clone to get pointer files only:

LFS is not used to the scm-backup procedure with creating a bare clone. LFS works with cloning from bare to normal (classic git use case), but it does not like to use bare to bare cloning. In the beginning, I had to test around with this part a lot to make the LFS backup work with the bare clone from scm-backup. Also, I totally forgot to mention how to restore the LFS backup... Sorry about that.
Restoring a backup including LFS files by just locally cloning the backup does not work. So, here is how it's done. (This would need to be included in the docs later as well.)

Fully restoring a "BackupRepo" created by scm-backup which includes LFS files:

  1. create another bare repo locally with git init --bare we call it "NewRepo"
  2. in the original "BackupRepo" that scm-backup created do:
    • git push --mirror "C:\Folder\Documents\NewRepo"
    • git lfs push --all "file:///C:\Folder\Documents\NewRepo" (be careful with the slashes)
  3. now clone the "NewRepo" locally, the local clone contains all LFS files

To be fair, I was confused a lot after cloning a repo for the tenth time... I hope these explanations are correct and somehow comprehensible :)

christianspecht commented 3 years ago

Sorry for the delay. I tried a lot of stuff.

No problem, you see that my delays are at least as long as yours. At the moment my time is very limited :-(

For a start, here are my answers to "Concerning the options file“:

In the end, it should work like this?

  • if (UseLfs) -> continue... (when false, skip everything)
  • if (LFSIsOnThisComputer()) -> continue... (when false, throw exception and warn the user the machine is missing LFS even though it is activated in options)
  • if (RepositoryContainsLFS(directory)) -> PullLFSFromRemote(remoteUrl, directory, credentials);

Yes, looks good. If someone explicitly enabled it in the config, IMO it should visibly fail when LFS is missing on their machine.

But I would prefer failing earlier...so that SCM Backup wouldn’t even start to backup.
The right place for this is the ScmValidator, this is the thing that checks for the existence of Git on the machine before the first backup starts ("Trying to find SCMs on this machine..." / "Found Git x.y.z")

This is called from the ScmBackup class.

In an ideal world, the call to the hoster's API (which happens before) would return the information per repo if it contains LFS files. If we had that, we could only do the check when there's actually a repo that contains LFS files, but at first glance not every hoster's API supports this. Gitlab has a lfs_objects_size property, but I can't find anything in GitHub's API docs.

So I think the only viable solution is adding code in ScmValidator.ValidateScms that fails when LFS is enabled in the config, but not installed on the machine.

Later, when we are sure that the "experimental" LFS support works, we will probably just set the config value to true by default...so people who get the "LFS not installed" warning can just disable it if they are sure they don't need it.

How do I use/access the bool UseLfs in the GitScm.cs?

The whole configuration is part of the Context , which is already injected into GitScm.

So you can access it via this.context.Config.Options.Git.UseLfs;


The next thing I'll try to do is adding some tests for backing up the LFS test repo.

christianspecht commented 3 years ago

I'm not sure if GitHub has a better way to contribute to a fork of my own repository.
I wasn't able to fork your fork 😎 , so I created a new repo based on yours and pushed some changes to it:

https://github.com/christianspecht/scm-backup-lfs

For now, I fixed the tests.
Next thing, as I said, I'll try to add some tests for LFS backups!

ItayHershko commented 2 years ago

Guys, thanks for all the work you are doing!

Any news when LFS support will be available?

christianspecht commented 2 years ago

@ItayHershko: I don't know when it will be available.
evolu1 (the Git LFS expert 😉 ) is doing all the research/work, I'm just giving advice about the inner workings of SCM Backup.
(and I'm giving it slooowly because of capacity issues / non-existing spare time right now)


@evolu1: I pushed some more changes to https://github.com/christianspecht/scm-backup-lfs, but so far I wasn't able to actually restore the large file.

There's now a test which is supposed to verify that the backup actually contains the large file (using the steps you listed), but the last clone action ("NewRepo" to final repo) fails with this error:

Cloning into 'C:\\Users\\cs\\AppData\\Local\\Temp\\_scm-backup-gitscm\\20210825211343370-finalrepo'...
done.
Downloading test1.lfs (1.0 MB)
Error downloading object: test1.lfs (30e1495): Smudge error: Error downloading test1.lfs (30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58): error transferring \"30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58\": [0] remote missing object 30e14955ebf1352266dc2ff8067e68104607e750abb9d3b36582b8af909fcb58

Errors logged to C:\\Users\\cs\\AppData\\Local\\Temp\\_scm-backup-gitscm\\20210825211343370-finalrepo\\.git\\lfs\\logs\\20210825T231351.0780564.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: test1.lfs: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

The log file mentioned in the message is here: 20210825T231351.0780564.log

To make sure the problem is not somehow caused by the fact that I'm calling the Git command line from C# code, I tried to do the same steps in a Windows batch file:

rem the bare repo created by SCM Backup
set backuprepo=C:\Users\cs\AppData\Local\Temp\_scm-backup-tests\20210814214300765-iscm-git-pull-lfs-new

rem the temp repo
set temprepo=C:\Users\cs\AppData\Local\Temp\_scm-backup-gitscm\xx

rem the final repo
set finalrepo=C:\Users\cs\AppData\Local\Temp\_scm-backup-gitscm\final

rd %temprepo% /s /q
rd %finalrepo% /s /q

md %temprepo%
cd %temprepo%
git init

cd %backuprepo%
git push --mirror %temprepo%

git lfs push --all "file:///%temprepo%"

git clone %temprepo% %finalrepo%

pause

But this fails already in the git push --mirror step, with a similar error message:

Enumerating objects: 8, done.
Counting objects: 100% (8/8), done.
Delta compression using up to 8 threads
Compressing objects: 100% (6/6), done.
Writing objects: 100% (8/8), 1.31 KiB | 1.31 MiB/s, done.
Total 8 (delta 0), reused 0 (delta 0), pack-reused 0
remote: error: refusing to update checked out branch: refs/heads/master
remote: error: By default, updating the current branch in a non-bare repository
remote: is denied, because it will make the index and work tree inconsistent
remote: with what you pushed, and will require 'git reset --hard' to match
remote: the work tree to HEAD.
remote:
remote: You can set the 'receive.denyCurrentBranch' configuration variable
remote: to 'ignore' or 'warn' in the remote repository to allow pushing into
remote: its current branch; however, this is not recommended unless you
remote: arranged to update its work tree to match what you pushed in some
remote: other way.
remote:
remote: To squelch this message and still keep the default behaviour, set
remote: 'receive.denyCurrentBranch' configuration variable to 'refuse'.
To C:\Users\cs\AppData\Local\Temp\_scm-backup-gitscm\xx
 ! [remote rejected] master -> master (branch is currently checked out)
error: failed to push some refs to 'C:\Users\cs\AppData\Local\Temp\_scm-backup-gitscm\xx'

Does this work on your machine?
Can you share a batch file with the exact steps that work on your machine?

igor-gorjanc commented 1 week ago

Is LFS support included in the latest release version?