SCons / scons

SCons - a software construction tool
http://scons.org
MIT License
2.11k stars 318 forks source link

Bug when combining CacheDir and MSVC_BATCH #4361

Open CiderMan opened 1 year ago

CiderMan commented 1 year ago

Describe the bug We have recently switched on CacheDir in our CI machine and, while this has largely been a positive experience, we have seen the odd link failure that we could not explain. I was eventually able to reproduce the issue and produce a minimal SConstruct that illustrates the problem. It is an interaction between the MSVS batching and the cache.

Required information

To reproduce the problem, I have created two C files and a SConstruct:

src/main.c

void Foo(void);

int main()
{
   Foo();
   return 0;
}

src/foo.c

void Foo(void)
{
   //
}

SConstruct

CacheDir(r"C:\tmp\cache")

env1 = Environment(MSVC_BATCH=True)
env1.VariantDir("build1", "src", duplicate=False)
env1.Program("build1/a.exe", ["build1/main.c", "build1/foo.c"])

env2 = Environment(MSVC_BATCH=True)
env2.VariantDir("build2", "src", duplicate=False)
env2.Program("build2/a.exe", ["build2/main.c", "build2/foo.c"])

The steps to reproduce the issue are:

  1. Perform a build (with just scons) to populate the cache
  2. Delete build1/main.obj
  3. Edit src/foo.c such that it needs to be rebuilt and won’t be in the cache
  4. Perform another build

Running scons, I would expect build1/main.obj to be retrieved from the cache, foo.c to be builts to produce build1/foo.obj and build2/foo.obj before the two Programs are linked. What I actually get is a build failure:

$ scons
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
Retrieved `build1\main.obj' from cache
cl /Fobuild1\\ /c src\main.c src\foo.c /nologo
main.c
foo.c
Generating Code...
link /nologo /OUT:build1\a.exe build1\main.obj build1\foo.obj
Retrieved `build2\main.obj' from cache
cl /Fobuild2\\ /c src\foo.c /nologo
foo.c
link /nologo /OUT:build2\a.exe build2\main.obj build2\foo.obj
LINK : fatal error LNK1181: cannot open input file 'build2\main.obj'
scons: *** [build2\a.exe] Error 1181
scons: building terminated because of errors.

At this point build2/main.obj has been removed (hence the link failure) but running scons again succeeds:

$ scons
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
Retrieved `build2\main.obj' from cache
Retrieved `build2\foo.obj' from cache
link /nologo /OUT:build2\a.exe build2\main.obj build2\foo.obj
scons: done building targets.
mwichmann commented 1 year ago

To add what I said in the referenced email thread: can reproduce this, and believe, though don't have definitive proof, that the presence of two variantdirs is also a necessary part of the triggering - however, I've not added the variantdir label onto this issue (yet).

CiderMan commented 1 year ago

OK, I've monkey patched os.unlink and can see where the file is getting deleted:

Retrieved `build2\main.obj' from cache
Unlinking build2\main.obj
  File "C:\Python38\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\tmp\repro\.venv\Scripts\scons.exe\__main__.py", line 7, in <module>
    sys.exit(main())
  File "c:\tmp\repro\.venv\lib\site-packages\SCons\Script\Main.py", line 1412, in main
    _exec_main(parser, values)
  File "c:\tmp\repro\.venv\lib\site-packages\SCons\Script\Main.py", line 1374, in _exec_main
    _main(parser)
  File "c:\tmp\repro\.venv\lib\site-packages\SCons\Script\Main.py", line 1154, in _main
    nodes = _build_targets(fs, options, targets, target_top)
  File "c:\tmp\repro\.venv\lib\site-packages\SCons\Script\Main.py", line 1349, in _build_targets
    jobs.run(postfunc = jobs_postfunc)
  File "c:\tmp\repro\.venv\lib\site-packages\SCons\Taskmaster\Job.py", line 123, in run
    self.job.start()
  File "c:\tmp\repro\.venv\lib\site-packages\SCons\Taskmaster\Job.py", line 221, in start
    task.execute()
  File "c:\tmp\repro\.venv\lib\site-packages\SCons\Script\Main.py", line 195, in execute
    SCons.Taskmaster.OutOfDateTask.execute(self)
  File "c:\tmp\repro\.venv\lib\site-packages\SCons\Taskmaster\__init__.py", line 242, in execute
    t.fs.unlink(t.get_internal_path())
  File "c:\tmp\repro\.venv\lib\site-packages\SCons\Node\FS.py", line 1184, in unlink
    return os.unlink(path)
  File "C:\tmp\repro\SConstruct", line 8, in _unlink
    traceback.print_stack()

In other words, it is the t.fs.unlink in the following piece of code:

        try:
            cached_targets = []
            for t in self.targets:
                if not t.retrieve_from_cache():
                    break
                cached_targets.append(t)
            if len(cached_targets) < len(self.targets):
                # Remove targets before building. It's possible that we
                # partially retrieved targets from the cache, leaving
                # them in read-only mode. That might cause the command
                # to fail.
                #
                for t in cached_targets:
                    try:
                        t.fs.unlink(t.get_internal_path())
                    except (IOError, OSError) as e:
                        SCons.Warnings.warn(SCons.Warnings.CacheCleanupErrorWarning,
                            "Failed copying all target files from cache, Error while attempting to remove file %s retrieved from cache: %s" % (t.get_internal_path(), e))
                self.targets[0].build()
            else:
                for t in cached_targets:
                    t.cached = 1
        except SystemExit:

Dumping out the values of cached_targets and self.targets at the time of the erroneous deletion:

cached_targets = ['build2\\main.obj']
self.targets = ['build2\\main.obj', 'build2\\foo.obj']

I'm guessing that this intends to rebuild main.obj but, because of batching and because main.obj is not part of the batch due to it being up-to-date, it doesn't get rebuilt?

I hope that this is of some help...

bdbaddog commented 1 year ago
I'm not sure batching can be compatible with cachedir.. How much speedup do you see from using MSVC_BATCH?
CiderMan commented 1 year ago

Hi @bdbaddog,

Doing a clean build of one of our build products as an example:

We could certainly disable MSVC_BATCH when building with a cache but we would need to analyse the results to be sure that the cache provides more speed up than the removal of batching causes slow down.

Would it be possible to implement the following logic for each batch:

While it is sub-optimal, I suspect that it would give a lot of the performance benefit of each mechanism while simplifying the work that needs to be done...

P.S. For reference, I used the non-MSVC_BATCH build to --cache-populate and then performed a clean build with the cache enabled (so everything was in the cache) and it took 3m18.218s. Note that all the builds are parallel builds on a 8 (logical) core i7-6700.