airbnb / streamalert

StreamAlert is a serverless, realtime data analysis framework which empowers you to ingest, analyze, and alert on data from any environment, using datasources and alerting logic you define.
https://streamalert.io
Apache License 2.0
2.86k stars 333 forks source link

"Too many open files" in debug mode #587

Closed 0xdabbad00 closed 6 years ago

0xdabbad00 commented 6 years ago

When running in the debug mode, the call to df -h /tmp will be run from https://github.com/airbnb/streamalert/blob/cf12209aa071cd4b2f85bfc5b77446bbb696d451/stream_alert/rule_processor/payload.py#L207

This may result in an exception being thrown:

[Errno 24] Too many open files: OSError
Traceback (most recent call last):
File "/var/task/stream_alert/rule_processor/main.py", line 39, in handler
StreamAlert(context).run(event)
File "/var/task/stream_alert/rule_processor/handler.py", line 148, in run
self._process_alerts(payload)
File "/var/task/stream_alert/rule_processor/handler.py", line 356, in _process_alerts
for record in payload.pre_parse():
File "/var/task/stream_alert/rule_processor/payload.py", line 155, in pre_parse
s3_file = self._get_object()
File "/var/task/stream_alert/rule_processor/payload.py", line 245, in _get_object
return self._download_object(region, bucket, key)
File "/var/task/stream_alert/rule_processor/payload.py", line 206, in _download_object
LOGGER.debug(os.popen('df -h /tmp | tail -1').read().strip())
OSError: [Errno 24] Too many open files

This will only present itself if debug output is on, and this may be masking another issue for myself, but perhaps this case should checked somehow (where you may have too many files in /tmp?). I also don't know why my /tmp would have too many files in it, or if I should clear that out somehow. Based on lambda limits, the /tmp directory should be limited to 512MB anyway ( https://docs.aws.amazon.com/lambda/latest/dg/limits.html ), and the max file descriptors should be set to 1024.

0xdabbad00 commented 6 years ago

Tracing this back up in the logs, it looks like everything is running smoothly with the lambda, and then it tries to download an S3 object and errors with this:

[Errno 24] Too many open files: SSLError
Traceback (most recent call last):
File "/var/task/stream_alert/rule_processor/main.py", line 39, in handler
StreamAlert(context).run(event)
File "/var/task/stream_alert/rule_processor/handler.py", line 148, in run
self._process_alerts(payload)
File "/var/task/stream_alert/rule_processor/handler.py", line 356, in _process_alerts
for record in payload.pre_parse():
File "/var/task/stream_alert/rule_processor/payload.py", line 155, in pre_parse
s3_file = self._get_object()
File "/var/task/stream_alert/rule_processor/payload.py", line 245, in _get_object
return self._download_object(region, bucket, key)
File "/var/task/stream_alert/rule_processor/payload.py", line 217, in _download_object
client.download_fileobj(bucket, key, data)
File "/var/runtime/boto3/s3/inject.py", line 570, in download_fileobj
return future.result()
File "/var/runtime/s3transfer/futures.py", line 73, in result
return self._coordinator.result()
File "/var/runtime/s3transfer/futures.py", line 233, in result
raise self._exception
SSLError: [Errno 24] Too many open files

Then it just keeps erroring with that OSError from the initial issue comment, for a few hundred times.

jacknagz commented 6 years ago

Thanks for reporting @0xdabbad00 !!

0xdabbad00 commented 6 years ago

I'm digging into this to find the root cause. I'll update this as I figure things out, but I suspect the issue might be isolated to my environment.

ryandeivert commented 6 years ago

hey @0xdabbad00, I'm fairly certain I know what the issue is. Taking a look at this code:

https://github.com/airbnb/streamalert/blob/e90b574c86f4c046e02d1dd05c208ec54d722fcb/stream_alert/rule_processor/payload.py#L262-L271

Those open calls are not using context managers, which is leaving the file handle open indefinitely (for the entire life of the container). If a container is being reused a ton of times, this could result in the error you're seeing. I was planning a fix similar to:

        if extension == '.gz':
            with gzip.open(s3_object, 'r') as s3_file:
                for num, line in enumerate(s3_file, start=1):
                    yield num, line.rstrip()
        else:
            with open(s3_object, 'r') as s3_file:
                for num, line in enumerate(s3_file, start=1):
                    yield num, line.rstrip()

        # AWS Lambda apparently does not reallocate disk space when files are
        # removed using os.remove(), so we must truncate them before removal
        with open(s3_object, 'w'):
            pass

You're welcome to make these changes and put a PR up, since I'm not in a state to do so right now. If you have other thoughts let me know!

0xdabbad00 commented 6 years ago

Unfortunately, this did not resolve the issue. Still digging in.

0xdabbad00 commented 6 years ago

I had also tried to see what files are open by adding the code:

import psutil
        for proc in psutil.process_iter():
            try:
                if "python" in str(proc.exe()):
                    LOGGER.info(proc.exe())
                    LOGGER.info(proc.open_files())
            except:
                pass

Unfortunately, this only printed [] for the open_files() call, which I think means Lambda may be doing something fishy that inhibits my ability to debug.

0xdabbad00 commented 6 years ago

We disabled our firehose configuration and are no longer experiencing these errors. I believe these errors came about at the time that I had enabled the firehose which was led me to attempt that. Therefore, I believe something to do with the firehose configuration is holding open resource handles. We'll continue investigating.

0xdabbad00 commented 6 years ago

We were behind on some commits and it looks like this resolved our problem: https://github.com/airbnb/streamalert/blob/347c64b2e77df22b44f6caae334f9748e3060cea/stream_alert/rule_processor/firehose.py#L292

ryandeivert commented 6 years ago

we've seen this internally an can confirm this is not fixed