SCons / scons

SCons - a software construction tool
http://scons.org
MIT License
2.05k stars 313 forks source link

possible race condition in msvc tools cache #4268

Open mwichmann opened 1 year ago

mwichmann commented 1 year ago

This is not a well investigated issue, just dropping something here to get it "written down".

A couple of builds on Windows have failed with what looks like corruption in the msvc cache file. This isn't confirmed to be consistently reproducible, but has been seen at least twice now, with an error that looks like this (this one is a real capture, the other one I saw wasn't saved, I just retried and it worked - which, if it was a race, would make sense, because unlike the CI setup where the cache is rebuilt for each new image spun up, mine stays around):

[00:17:38] 605/1266 (47.79%) C:\Python36\python.exe test\MSVC\MSVC_SPECTRE_LIBS.py
[00:17:38] C:\projects\scons\scripts\scons.py returned 2
[00:17:38] STDOUT =========================================================================
[00:17:38]
[00:17:38] STDERR =========================================================================
[00:17:38] JSONDecodeError: Expecting ',' delimiter: line 382 column 125 (char 24284):
[00:17:38]   File "C:\Users\appveyor\AppData\Local\Temp\1\testcmd.4240.tmk0iejo\SConstruct", line 3:
[00:17:38]     env = Environment(MSVC_VERSION='11.0', MSVC_SPECTRE_LIBS=None, tools=['msvc'])
[00:17:38]   File "C:\projects\scons\SCons\Environment.py", line 1030:
[00:17:38]     apply_tools(self, tools, toolpath)
[00:17:38]   File "C:\projects\scons\SCons\Environment.py", line 116:
[00:17:38]     _ = env.Tool(tool)
[00:17:38]   File "C:\projects\scons\SCons\Environment.py", line 1911:
[00:17:38]     tool(self)
[00:17:38]   File "C:\projects\scons\SCons\Tool\__init__.py", line 265:
[00:17:38]     self.generate(env, *args, **kw)
[00:17:38]   File "C:\projects\scons\SCons\Tool\msvc.py", line 298:
[00:17:38]     msvc_setup_env_once(env, tool=tool_name)
[00:17:38]   File "C:\projects\scons\SCons\Tool\MSCommon\vc.py", line 1110:
[00:17:38]     msvc_setup_env(env)
[00:17:38]   File "C:\projects\scons\SCons\Tool\MSCommon\vc.py", line 1259:
[00:17:38]     d = msvc_find_valid_batch_script(env,version) 
[00:17:38]   File "C:\projects\scons\SCons\Tool\MSCommon\vc.py", line 1162:
[00:17:38]     d = script_env(env, vc_script, args=arg)
[00:17:38]   File "C:\projects\scons\SCons\Tool\MSCommon\vc.py", line 1009:
[00:17:38]     script_env_cache = common.read_script_env_cache()
[00:17:38]   File "C:\projects\scons\SCons\Tool\MSCommon\common.py", line 125:
[00:17:38]     envcache_list = json.load(f)
[00:17:38]   File "C:\Python36\lib\json\__init__.py", line 299:
[00:17:38]     parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
[00:17:38]   File "C:\Python36\lib\json\__init__.py", line 354:
[00:17:38]     return _default_decoder.decode(s)
[00:17:38]   File "C:\Python36\lib\json\decoder.py", line 339:
[00:17:38]     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
[00:17:38]   File "C:\Python36\lib\json\decoder.py", line 355:
[00:17:38]     obj, end = self.scan_once(s, idx)

If it was a race, we could see one test creating the cache and another test come along, see the cache exists and proceed to read it, but catch it before it's all written. But I think many tests would already have run which would have caused the cache to be created, so maybe it's a different kind of race.

Perhaps the code in Tool/MSCommon/common.py should be catching JSONDecodeError and doing something better with it than dying?

mwichmann commented 1 year ago

@jcbrill is this anything you've ever seen?

jcbrill commented 1 year ago

@mwichmann I have seen it before myself.

I suspect it can be corrupted when running multiple processes since the cache is read once but written every time a new entry is added possibly exacerbated by buffered I/O. Likely worse if the default cache location is the user folder rather than per-project.

Project-local cache files might be worth testing. Periodically, I wonder what it would take to use Windows API synchronization primitives to synchronize access.

It is probably better just to clear the cache and carry on rather than exit as the real purpose is just to save some time.

jcbrill commented 1 year ago

Perhaps multiple writes stepping on each other which corrupts the file for the next read?

mwichmann commented 1 year ago

one could effectively "lock" by writing to a tempfile and then moving to the real name... though on Windows that could trigger the infamous problem of failing because the file is busy I suppose.

Maybe just ignore the cache if we got an error is enough, as you say - that policy already exists somewhere in that code.

jcbrill commented 1 year ago

Another problem with writes from multiple processes is that of information loss: the cache is only read once but written multiple times. A write could cause the cache file to "lose" entries from writes from another process (i.e., the file cache is not "merged" when updated/written and only written from the memory of the current process).

mwichmann commented 1 year ago

Sigh. This thing started its life as a quick hack because running the full set of tests was too slow (we fire off an scons "run" something like 3,000 times, that's a lot of times to pay the tax of detecting the MSVC bits - in many cases x2 because of the default environment as well). Most normal uses are to start a build... once. It feels like the Hydra is growing ever more heads here...

jcbrill commented 1 year ago

It is still a good idea for running the test suite.

One possibility is to generate the cache file based on all versions/combinations of visual studio command-line invocations outside of SCons and then change the cache to be "read-only" in SCons (i.e., don't write the cache file out even if entries are not in the cache).

This would only require generating a new cache file whenever any visual studio installations are updated. Effectively, pre-populate the cache file outside of SCons.

mwichmann commented 1 year ago

Encountered another case of this. I fiddled with my installs - I was running a bit low on disk space and wanted to reduce the number of SDKs installed, and ended up removing one that had a path stored for finding headers (this would be a Windows Kits path) for the default compiler, so all builds failed due to missing headers or libs. Not sure it's worth adding yet another check for that.

After removing the cachefile, a full test run worked, but one test his a similar issue. Probably the details are unimportant - it was a -j 12 build so lots of chance for any concurrency problems that could exist. I'll paste the fail here anyway:

610/1266 (48.18%) C:\Users\mats\AppData\Local\Programs\Python\Python311\python.exe test\MSVC\MSVC_UWP_APP.py
C:\Users\mats\Documents\github\scons\scripts\scons.py returned 2
STDOUT =========================================================================

STDERR ========================================================================= 
JSONDecodeError: Expecting value: line 1 column 1 (char 0):
  File "C:\Users\mats\AppData\Local\Temp\testcmd.16320.7o_sotpq\SConstruct", line 3:
    env = Environment(MSVC_VERSION='14.2', MSVC_UWP_APP=True, tools=['msvc'])
  File "C:\Users\mats\Documents\github\scons\SCons\Environment.py", line 1030:
    apply_tools(self, tools, toolpath)
  File "C:\Users\mats\Documents\github\scons\SCons\Environment.py", line 116:
    _ = env.Tool(tool)
  File "C:\Users\mats\Documents\github\scons\SCons\Environment.py", line 1911:
    tool(self)
  File "C:\Users\mats\Documents\github\scons\SCons\Tool\__init__.py", line 265:
    self.generate(env, *args, **kw)
  File "C:\Users\mats\Documents\github\scons\SCons\Tool\msvc.py", line 298:
    msvc_setup_env_once(env, tool=tool_name)
  File "C:\Users\mats\Documents\github\scons\SCons\Tool\MSCommon\vc.py", line 1110:
    msvc_setup_env(env)
  File "C:\Users\mats\Documents\github\scons\SCons\Tool\MSCommon\vc.py", line 1259:
    d = msvc_find_valid_batch_script(env,version)
  File "C:\Users\mats\Documents\github\scons\SCons\Tool\MSCommon\vc.py", line 1162:
    d = script_env(env, vc_script, args=arg)
  File "C:\Users\mats\Documents\github\scons\SCons\Tool\MSCommon\vc.py", line 1009:
    script_env_cache = common.read_script_env_cache()
  File "C:\Users\mats\Documents\github\scons\SCons\Tool\MSCommon\common.py", line 125:
    envcache_list = json.load(f)
  File "C:\Users\mats\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 293:
    return loads(fp.read(),
  File "C:\Users\mats\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 346:
    return _default_decoder.decode(s)
  File "C:\Users\mats\AppData\Local\Programs\Python\Python311\Lib\json\decoder.py", line 337:
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\mats\AppData\Local\Programs\Python\Python311\Lib\json\decoder.py", line 355:
    raise JSONDecodeError("Expecting value", s, err.value) from None

FAILED test of C:\Users\mats\Documents\github\scons\scripts\scons.py
        at line 673 of C:\Users\mats\Documents\github\scons\testing\framework\TestCommon.py (_complete)
        from line 775 of C:\Users\mats\Documents\github\scons\testing\framework\TestCommon.py (run)
        from line 458 of C:\Users\mats\Documents\github\scons\testing\framework\TestSCons.py (run)
        from line 109 of C:\Users\mats\Documents\github\scons\test\MSVC\MSVC_UWP_APP.py

Test execution time: 58.3 seconds
jcbrill commented 1 year ago

For your amusement (or horror).

There are two issues with the msvc cache: exclusive access and information loss.

The recent file locking changes target exclusive access.

The information loss problem is exacerbated by the cache discipline: read-once, write-many.

In the abstract, consider the following cache behavior of multiple processes with varying execution duration simulating two simultaneous jobs:

  R-------W--------W--------W  [Process A]
  R--W                         [Process B]
       R----W                  [Process C]
              R--W             [Process D]
                     R---W     [Process E]

Assuming all processes write unique mvsc entries, the cache contains the 3 entries from process A (i.e., the cached information from processes B, C, D, and E have been overwritten by the last Process A write).

Instead of exclusive-read and exclusive-write, the discipline should be exclusive-update (a read immediately followed by write, if necessary).

The same example:

  U-------U--------U--------U  [Process A]
  U--U                         [Process B]
       U----U                  [Process C]
              U--U             [Process D]
                     U---U     [Process E]

Assuming all processes write unique msvc entries, the cache contains 7 entries as expected.

The json format as currently employed has a number of limitations that can be overcome by employing a very simple custom cache reader and using a record-oriented json data file format with a json header.

For example:

{ ... }  <- header record
#        <- single character record delimiter
{ ... }  <- cache entry
#        <- single character record delimiter
{ ... }  <- cache entry
#        <- single character record delimiter

A json structure can be loaded every time a # is seen.

The header can contain file format version information and the number of records (this needs to be back-patched though) and cache entries can contain a variable that contains paths to verify upon load:

{"cache_version": 1, "cache_count": 2}
#
{ ... "VERIFY_PATHS": [...] ...}
#
{ ... "VERIFY_PATHS": [...] ...}
#

Notes:

My own "stress-tests" for MSVC code changes involve excruciatingly long run times in some cases which is a vastly different scenario than the scons tests.

Due to the combinatorial explosion of msvc installations and possible argument combinations, a current test run this morning created 1744 environments and had a cache count of 1697 entries and took more than 50 minutes to build the cache and compile a simple "hello, world" program for all 1744 environments. That scenario only used the major sdk versions (e.g., for windows 11, 10, 8.1) and not all sdk versions installed. (Although it did have at least one of every msvc version going back to 6.0.)

mwichmann commented 1 year ago

Well, in my defense, I did mention the information loss issue (much more briefly) in the locking PR... just don't know any way to deal with it. We have the same potential issue (which occasionally does bite people who try concurrent use of the same build tree) with the sconsign signature "database": we read everything on startup, then the file remains in that state while the in-memory copy changes, then it's written out in the end. So you can have two divergent in-memory copies, and somebody's write is going to win and somebody's is going to lose. For sconsign of a very large build tree there's a performance hit of "read everything even if you don't need it" and "write everything even if (almost?) nothing changed". I've suggested that one should become a real database someday, but I've not pushed the change to use pysqlite for that, although it worked in limited trials when I prototyped it. I think that may be overkill for the less formal msvc cache.

jcbrill commented 1 year ago

Well, in my defense, I did mention the information loss issue (much more briefly) in the locking PR... just don't know any way to deal with it.

I was not being critical. I apologize.

The solution is to read the cache file before every write while the lock is held. A cache file format change makes this easier.

Code blob with load, dump, update, version check, and path verification protected by file locking: https://github.com/jcbrill/mswindev/blob/0abd9e276e420212cf8ec906a4634aa1040f5a7f/mswindev/vsdetect.py#L12794-L13395

_cache_update_file (line 13314 is the implementation of the exclusive read and write).

Ignoring logging and timing code, it is pretty straightforward: load the file and dump the file (subject to internal bookkeeping).

Tested during development with 4 processes running simultaneously with the same stress-test sconstruct (i.e., running in lock-step).

jcbrill commented 1 year ago

... for large cache files, there is a performance hit of reading the file, converting the objects to json and then discarding the records that are already in the cache. Separating the key from the full entry would allow discarding the full entry without conversion if the key was already in the process memory cache at the cost of reading the file.

For msvc, the cost tradeoff is file processing versus running one more msvc batch files which can be considered really expensive (or there would be no need for a cache in the first place).

bdbaddog commented 1 year ago

... for large cache files, there is a performance hit of reading the file, converting the objects to json and then discarding the records that are already in the cache. Separating the key from the full entry would allow discarding the full entry without conversion if the key was already in the process memory cache at the cost of reading the file.

For msvc, the cost tradeoff is file processing versus running one more msvc batch files which can be considered really expensive (or there would be no need for a cache in the first place).

Can you upload your largest cache json file? So we can do a little perf testing with it?

jcbrill commented 1 year ago

Can you upload your largest cache json file? So we can do a little perf testing with it?

I would happy to generate a large cache file.

To be clear, this does not represent a real-world situation nor is it representative of the scons test suite.

I use my own research tool to generate all possible combinations of arguments: MSVC_VERSION, HOST_ARCH, TARGET_ARCH, MSVC_TOOLSET_VERSION, MSVC_SPECTRE_LIBS, MSVC_UWP_APP, and MSVC_SDK_VERSION to test scons detection changes, argument validation, script errors, and that all expected combinations actually compile a simple hello world application.

Note: HOST_ARCH is employed to make sure every accessible batch file is called even though it won't be used directly in SCons environment construction as an end-user should not be changing HOST_ARCH. A new variable (MSVC_TOOL_ARCH) would not be the worst thing in the world. This could be useful if a compiler generation bug is suspected.

I started a run using ALL combinations which is going to result in an astronomical number of environments due to the number of SDK versions I have installed on my development box and the number of msvc toolsets for each modern msvs installation.

At present, the test sconstruct is still in 2019 (14.2) and has generated 2783 environments (fewer cache entries) and the cache file on disk is 10.1MB and growing.

By the time this finishes, the cache file will likely be enormous. I'm running against a personal branch containing msvc "fixes" and enhancements.

Is this what you really want?

bdbaddog commented 1 year ago

@jcbrill - yes. Your huge json would represent the unrealistically large cache file. So if we want to see how fast/slow read/update/write is vs read...... write... it'd be a reasonable example.

jcbrill commented 1 year ago

Ok will do.

FWIW, in the case of a single process (i.e., no contention), read/update/write is going to be significantly faster as there really is no reason to load the entire file (the file modification timestamp and record count in the header has not changed since the last write) and the one new record each write cycle can be appended to the file rather than dumping the entire data structure. There is a non-trivial bookkeeping cost but it should be far less than dumping the entire data structure to disk for each new entry. Reads are cheap (no need to read whole file) and writes are far less expensive.

jcbrill commented 1 year ago

@mwichmann There may be a benefit to explicitly calling flush on the cache file handle and possibly calling os.fsync as well after the json dump call and prior to the context manager taking over.

Annotated snippet:

        with SCons.Util.FileLock(CONFIG_CACHE, timeout=5, writer=True), p.open('w') as f:
            # Convert the cache dictionary to a list of cache entry
            # dictionaries. The cache key is converted from a tuple to
            # a list for compatibility with json.
            envcache_list = [
                {'key': list(key), 'data': data} for key, data in cache.items()
            ]
            json.dump(envcache_list, f, indent=2)
            f.flush()                   # <- explicit flush
            os.fsync(f.fileno())        # <- explicit sync

Not sure about os.sync. In my own work, the os.fsync() call is based on a configuration value and it is not enabled currently. I don't remember why, either it didn't seem to make a difference or was expensive or both.

mwichmann commented 1 year ago

sigh.

this stuff has its history in 40 year old (and older) UNiX systems...

you don't need the flush, which has to do with getting the data out of Python and into the OS... https://docs.python.org/3/library/io.html#io.IOBase.close

fsync might be a different story, that has to do with getting the data out of OS buffers and onto a disk.

jcbrill commented 1 year ago

If using fsync, recommended to flush buffered files first: https://docs.python.org/3/library/os.html#os.fsync

mwichmann commented 1 year ago

Well, the point was that closing the file assures the flush(), which the context manager around the open/json.dump sequence should guarantee. But hey... who ever knows.

jcbrill commented 1 year ago

Cache file (22.7mb json, 477kb zip): scons_msvc_cache.zip

Log ouput:

Running scons (16:24:48.36) ...
...
Total: builds: 6644, cache: 6477
...
Scons finished (20:46:40.12)