boto / botocore

The low-level, core functionality of boto3 and the AWS CLI.
Apache License 2.0
1.45k stars 1.06k forks source link

Race condition in botocore's use of JSONFileCache for credential loading #3213

Open mikelambert opened 4 days ago

mikelambert commented 4 days ago

Describe the bug

I am getting the a rare crash when using botocore to load files.

It's a JSONDecodeError: Expecting value: line 1 column 1 (char 0) stemming from JSONFileCache.__getitem__, which looks to imply the json file is empty (didn't find any value at char 0). Someone helped point out it might be a race condition in the JSONFileCache.__set__ , which appears to do:

            f.truncate()
            f.write(file_content)

We have multiple concurrent processes starting on a box that each are using botocore, so maybe this is just a race condition if one of them happens to look at the file post-truncate-pre-write? Not sure if a flock, or write-then-rename, or something else ends up a proper solution here?

Expected Behavior

It shouldn't crash

Current Behavior

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
  File "botocore/utils.py", line 3518, in __getitem__
    return json.load(f)
  File "__init__.py", line 293, in load
    return loads(fp.read(),
  File "__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None

  File "botocore/client.py", line 553, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "botocore/client.py", line 989, in _make_api_call
    http, parsed_response = self._make_request(
  File "botocore/client.py", line 1015, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "botocore/endpoint.py", line 198, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "botocore/endpoint.py", line 134, in create_request
    self._event_emitter.emit(
  File "botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "botocore/signers.py", line 105, in handler
    return self.sign(operation_name, request)
  File "botocore/signers.py", line 186, in sign
    auth = self.get_auth_instance(**kwargs)
  File "botocore/signers.py", line 301, in get_auth_instance
    frozen_credentials = credentials.get_frozen_credentials()
  File "botocore/credentials.py", line 634, in get_frozen_credentials
    self._refresh()
  File "botocore/credentials.py", line 522, in _refresh
    self._protected_refresh(is_mandatory=is_mandatory_refresh)
  File "botocore/credentials.py", line 538, in _protected_refresh
    metadata = self._refresh_using()
  File "botocore/credentials.py", line 685, in fetch_credentials
    return self._get_cached_credentials()
  File "botocore/credentials.py", line 693, in _get_cached_credentials
    response = self._load_from_cache()
  File "botocore/credentials.py", line 711, in _load_from_cache
    creds = deepcopy(self._cache[self._cache_key])
  File "botocore/utils.py", line 3520, in __getitem__
    raise KeyError(cache_key)

Reproduction Steps

Hard to repro, and haven't tried myself. I assume thousands of processes recreating the botocore cached credentials file would do it.

Possible Solution

Perhaps a flock or a write-to-temp-file-then-rename-to-destination-file-address, instead of truncate-then-write?

Additional Information/Context

No response

SDK version used

1.34.42

Environment details (OS name and version, etc.)

AWS instance running x86_64 GNU/Linux

Laurent-Dx commented 3 days ago

The following does reproduce the issue instantly:

from threading import Thread
from botocore.utils import JSONFileCache

cache = JSONFileCache()

def f():
    for i in range(100000):
        cache["key"] = 10
        cache["key"]

threads = []
for i in range(2):
    thread = Thread(target=f, name=f"Thread-{i}")
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

Adding a lock in each of __getitem__, __delitem__, __setitem__ as you suggested does appear to solve the issue.

tim-finnigan commented 2 days ago

Thanks for reaching out. The error you referenced was also reported here: https://github.com/boto/botocore/issues/3106. I was advised that an error message like this could help clarify the behavior here: https://github.com/boto/botocore/pull/3183/files. Does deleting the cache file fix this for you?

mikelambert commented 2 days ago

Yeah wrapping this in a retry in our app code does work (and is what I did in parallel to filing this bug). Apologies for not realizing it was previously reported. And thank you Laurent for the multi-threaded repro! (Our problematic setup is multi-process, which is why I didn't propose in-process locks)