Graylog2 / graylog-s3-lambda

An AWS Lambda function that reads logs from S3 and sends them to Graylog
GNU General Public License v3.0
12 stars 6 forks source link

Current release causes excessive timeouts #38

Open ghost opened 2 years ago

ghost commented 2 years ago

I had a version of graylog-s3-lambda that was created via a Cloudflare ticket and updated by Graylog engineers. This version was based on version 1.0.1.

graylog-s3-lambda-060821netty4.1.65-1.0.1.jar.zip

After updating from the attached jar file, there are significant timeouts with retrieving files. It doesn't look like the logs are not being sent but because of issues with flushing, the process times out and repeats twice before shutting down. This effectively triples the cost of the lambda function.

Any thoughts on what might be the issue? I've attached DEBUG level log files for a timeout process within AWS. debug.log

beingtmk commented 2 years ago

anything on this?

ghost commented 1 year ago

This issue is still present. There are also Exception in thread "main" java.lang.Error: java.lang.OutOfMemoryError: Metaspace errors being thrown when the memory allocated is less than 300MB

danotorrey commented 1 year ago

@joberthar Thanks for the report and details.

because of issues with flushing, the process times out and repeats twice before shutting down. This effectively triples the cost of the lambda function

Can you please tell us a bit more about what seems to be tripling the cost of the Lambda function? Did you observe and specific issues with flushing?

ghost commented 1 year ago

The issues seem to be around memory allocation. When I use the graylog-s3-lambda-060821netty4.1.65-1.0.1.jar file, I don't have any issues when I set memory to 271MB. When I update to 1.0.5, I must increase memory to at least 384 MB to prevent it from producing the Exception in thread "main" java.lang.Error: java.lang.OutOfMemoryError: Metaspace error.

This is what I am currently using, which was a package that Cloudflare the the Graylog team built to support Firewall Matches details from Cloudflare logs. graylog-s3-lambda-060821netty4.1.65-1.0.1.jar.zip

This is the log that shows the timeout errors: The most notable item is the ERROR GelfSenderThread:175 - Failed to flush messages in [100] attempts. Shutting down anyway. error. It also times out after 90 seconds, which is already way too long.

START RequestId: 0f9e0e7d-35bd-4f2a-b8d4-3c2080e0c1ec Version: $LATEST
2022-09-22 16:14:14 0f9e0e7d-35bd-4f2a-b8d4-3c2080e0c1ec INFO  GraylogS3Function:65 - Log level is now set to [ERROR].
io.netty.channel.socket.nio.NioSocketChannel
org.graylog2.gelfclient.encoder.GelfMessageJsonEncoder
io.netty.channel.MultithreadEventLoopGroup
io.netty.util.internal.InternalThreadLocalMap
io.netty.util.internal.PlatformDependent0
io.netty.channel.AbstractChannel
io.netty.util.concurrent.AbstractEventExecutor
io.netty.util.concurrent.GlobalEventExecutor
org.graylog2.gelfclient.transport.AbstractGelfTransport
io.netty.util.internal.PlatformDependent
org.graylog2.gelfclient.transport.GelfSenderThread
com.github.joschi.jadconfig.JadConfig
io.netty.channel.AbstractChannelHandlerContext
io.netty.channel.nio.AbstractNioChannel
org.graylog.integrations.s3.CompressionType
io.netty.channel.DefaultChannelId
io.netty.channel.ChannelInitializer
io.netty.util.concurrent.SingleThreadEventExecutor
org.graylog.integrations.s3.codec.CloudflareLogCodec
io.netty.util.NetUtilInitializations
org.graylog2.gelfclient.transport.GelfTcpTransport
io.netty.util.NetUtil
io.netty.util.concurrent.DefaultPromise.rejectedExecution
io.netty.util.internal.CleanerJava6
io.netty.util.ResourceLeakDetector
io.netty.buffer.PooledByteBufAllocator
io.netty.util.internal.MacAddressUtil
io.netty.util.internal.SystemPropertyUtil
io.netty.channel.ChannelOutboundBuffer
io.netty.util.concurrent.DefaultPromise
io.netty.bootstrap.Bootstrap
io.netty.resolver.AddressResolverGroup
io.netty.buffer.ByteBufUtil
io.netty.buffer.PoolThreadCache
org.graylog.integrations.s3.S3EventProcessor
io.netty.channel.nio.NioEventLoop
org.graylog.integrations.s3.S3ScannerFactory
org.graylog.integrations.s3.GraylogS3Function
org.graylog.integrations.s3.ProtocolType
org.graylog.integrations.s3.ContentType
io.netty.util.internal.logging.InternalLoggerFactory
io.netty.channel.DefaultChannelPipeline
io.netty.channel.ChannelHandlerMask
END RequestId: 0f9e0e7d-35bd-4f2a-b8d4-3c2080e0c1ec
REPORT RequestId: 0f9e0e7d-35bd-4f2a-b8d4-3c2080e0c1ec  Duration: 11941.22 ms   Billed Duration: 11942 ms   Memory Size: 384 MB Max Memory Used: 192 MB Init Duration: 4793.01 ms   
START RequestId: abb0f30a-4e9a-4d4d-96da-8bd232b19d01 Version: $LATEST
io.netty.channel.socket.nio.NioSocketChannel
org.graylog2.gelfclient.encoder.GelfMessageJsonEncoder
io.netty.channel.MultithreadEventLoopGroup
io.netty.util.internal.InternalThreadLocalMap
io.netty.util.internal.PlatformDependent0
io.netty.channel.AbstractChannel
io.netty.util.concurrent.AbstractEventExecutor
io.netty.util.concurrent.GlobalEventExecutor
org.graylog2.gelfclient.transport.AbstractGelfTransport
io.netty.util.internal.PlatformDependent
org.graylog2.gelfclient.transport.GelfSenderThread
com.github.joschi.jadconfig.JadConfig
io.netty.channel.AbstractChannelHandlerContext
io.netty.channel.nio.AbstractNioChannel
org.graylog.integrations.s3.CompressionType
io.netty.channel.DefaultChannelId
io.netty.buffer.AbstractByteBuf
io.netty.channel.ChannelInitializer
io.netty.util.concurrent.SingleThreadEventExecutor
org.graylog.integrations.s3.codec.CloudflareLogCodec
io.netty.util.NetUtilInitializations
org.graylog2.gelfclient.transport.GelfTcpTransport
io.netty.util.NetUtil
io.netty.util.concurrent.DefaultPromise.rejectedExecution
io.netty.util.internal.CleanerJava6
io.netty.util.ResourceLeakDetector
io.netty.buffer.PooledByteBufAllocator
io.netty.util.internal.MacAddressUtil
io.netty.util.internal.SystemPropertyUtil
io.netty.util.Recycler
io.netty.channel.ChannelOutboundBuffer
io.netty.util.concurrent.DefaultPromise
io.netty.bootstrap.Bootstrap
io.netty.resolver.AddressResolverGroup
io.netty.buffer.ByteBufUtil
io.netty.buffer.PoolThreadCache
org.graylog.integrations.s3.S3EventProcessor
io.netty.util.ReferenceCountUtil
io.netty.channel.nio.NioEventLoop
org.graylog.integrations.s3.S3ScannerFactory
io.netty.util.ResourceLeakDetectorFactory
org.graylog.integrations.s3.GraylogS3Function
org.graylog.integrations.s3.ProtocolType
org.graylog.integrations.s3.ContentType
io.netty.util.internal.logging.InternalLoggerFactory
io.netty.channel.DefaultChannelPipeline
io.netty.channel.ChannelHandlerMask
2022-09-22 16:16:40 abb0f30a-4e9a-4d4d-96da-8bd232b19d01 ERROR GelfSenderThread:175 - Failed to flush messages in [100] attempts. Shutting down anyway.
END RequestId: abb0f30a-4e9a-4d4d-96da-8bd232b19d01
REPORT RequestId: abb0f30a-4e9a-4d4d-96da-8bd232b19d01  Duration: 32072.92 ms   Billed Duration: 32073 ms   Memory Size: 384 MB Max Memory Used: 194 MB 
START RequestId: 5ff803dd-ba44-4d83-881e-1a63b1719b9d Version: $LATEST
io.netty.channel.socket.nio.NioSocketChannel
org.graylog2.gelfclient.encoder.GelfMessageJsonEncoder
io.netty.channel.MultithreadEventLoopGroup
io.netty.util.internal.InternalThreadLocalMap
io.netty.util.internal.PlatformDependent0
io.netty.channel.AbstractChannel
io.netty.util.concurrent.AbstractEventExecutor
io.netty.util.concurrent.GlobalEventExecutor
org.graylog2.gelfclient.transport.AbstractGelfTransport
io.netty.util.internal.PlatformDependent
org.graylog2.gelfclient.transport.GelfSenderThread
com.github.joschi.jadconfig.JadConfig
io.netty.channel.AbstractChannelHandlerContext
io.netty.channel.nio.AbstractNioChannel
org.graylog.integrations.s3.CompressionType
io.netty.channel.DefaultChannelId
io.netty.buffer.AbstractByteBuf
io.netty.channel.ChannelInitializer
io.netty.util.concurrent.SingleThreadEventExecutor
org.graylog.integrations.s3.codec.CloudflareLogCodec
io.netty.util.NetUtilInitializations
org.graylog2.gelfclient.transport.GelfTcpTransport
io.netty.util.NetUtil
io.netty.util.concurrent.DefaultPromise.rejectedExecution
io.netty.util.internal.CleanerJava6
io.netty.util.ResourceLeakDetector
io.netty.buffer.PooledByteBufAllocator
io.netty.util.internal.MacAddressUtil
io.netty.util.internal.SystemPropertyUtil
io.netty.util.Recycler
io.netty.channel.ChannelOutboundBuffer
io.netty.util.concurrent.DefaultPromise
io.netty.bootstrap.Bootstrap
io.netty.resolver.AddressResolverGroup
io.netty.buffer.ByteBufUtil
io.netty.buffer.PoolThreadCache
org.graylog.integrations.s3.S3EventProcessor
io.netty.util.ReferenceCountUtil
io.netty.channel.nio.NioEventLoop
org.graylog.integrations.s3.S3ScannerFactory
io.netty.util.ResourceLeakDetectorFactory
org.graylog.integrations.s3.GraylogS3Function
org.graylog.integrations.s3.ProtocolType
org.graylog.integrations.s3.ContentType
io.netty.util.internal.logging.InternalLoggerFactory
io.netty.channel.DefaultChannelPipeline
io.netty.channel.ChannelHandlerMask
END RequestId: 5ff803dd-ba44-4d83-881e-1a63b1719b9d
REPORT RequestId: 5ff803dd-ba44-4d83-881e-1a63b1719b9d  Duration: 90018.11 ms   Billed Duration: 90000 ms   Memory Size: 384 MB Max Memory Used: 194 MB 
2022-09-22T16:18:10.843Z 5ff803dd-ba44-4d83-881e-1a63b1719b9d Task timed out after 90.02 seconds

START RequestId: 0f9e0e7d-35bd-4f2a-b8d4-3c2080e0c1ec Version: $LATEST
--
2022-09-22 16:14:14 0f9e0e7d-35bd-4f2a-b8d4-3c2080e0c1ec INFO  GraylogS3Function:65 - Log level is now set to [ERROR].
io.netty.channel.socket.nio.NioSocketChannel
org.graylog2.gelfclient.encoder.GelfMessageJsonEncoder
io.netty.channel.MultithreadEventLoopGroup
io.netty.util.internal.InternalThreadLocalMap
io.netty.util.internal.PlatformDependent0
io.netty.channel.AbstractChannel
io.netty.util.concurrent.AbstractEventExecutor
io.netty.util.concurrent.GlobalEventExecutor
org.graylog2.gelfclient.transport.AbstractGelfTransport
io.netty.util.internal.PlatformDependent
org.graylog2.gelfclient.transport.GelfSenderThread
com.github.joschi.jadconfig.JadConfig
io.netty.channel.AbstractChannelHandlerContext
io.netty.channel.nio.AbstractNioChannel
org.graylog.integrations.s3.CompressionType
io.netty.channel.DefaultChannelId
io.netty.channel.ChannelInitializer
io.netty.util.concurrent.SingleThreadEventExecutor
org.graylog.integrations.s3.codec.CloudflareLogCodec
io.netty.util.NetUtilInitializations
org.graylog2.gelfclient.transport.GelfTcpTransport
io.netty.util.NetUtil
io.netty.util.concurrent.DefaultPromise.rejectedExecution
io.netty.util.internal.CleanerJava6
io.netty.util.ResourceLeakDetector
io.netty.buffer.PooledByteBufAllocator
io.netty.util.internal.MacAddressUtil
io.netty.util.internal.SystemPropertyUtil
io.netty.channel.ChannelOutboundBuffer
io.netty.util.concurrent.DefaultPromise
io.netty.bootstrap.Bootstrap
io.netty.resolver.AddressResolverGroup
io.netty.buffer.ByteBufUtil
io.netty.buffer.PoolThreadCache
org.graylog.integrations.s3.S3EventProcessor
io.netty.channel.nio.NioEventLoop
org.graylog.integrations.s3.S3ScannerFactory
org.graylog.integrations.s3.GraylogS3Function
org.graylog.integrations.s3.ProtocolType
org.graylog.integrations.s3.ContentType
io.netty.util.internal.logging.InternalLoggerFactory
io.netty.channel.DefaultChannelPipeline
io.netty.channel.ChannelHandlerMask
END RequestId: 0f9e0e7d-35bd-4f2a-b8d4-3c2080e0c1ec
REPORT RequestId: 0f9e0e7d-35bd-4f2a-b8d4-3c2080e0c1ec  Duration: 11941.22 ms   Billed Duration: 11942 ms   Memory Size: 384 MB Max Memory Used: 192 MB Init Duration: 4793.01 ms
START RequestId: abb0f30a-4e9a-4d4d-96da-8bd232b19d01 Version: $LATEST
io.netty.channel.socket.nio.NioSocketChannel
org.graylog2.gelfclient.encoder.GelfMessageJsonEncoder
io.netty.channel.MultithreadEventLoopGroup
io.netty.util.internal.InternalThreadLocalMap
io.netty.util.internal.PlatformDependent0
io.netty.channel.AbstractChannel
io.netty.util.concurrent.AbstractEventExecutor
io.netty.util.concurrent.GlobalEventExecutor
org.graylog2.gelfclient.transport.AbstractGelfTransport
io.netty.util.internal.PlatformDependent
org.graylog2.gelfclient.transport.GelfSenderThread
com.github.joschi.jadconfig.JadConfig
io.netty.channel.AbstractChannelHandlerContext
io.netty.channel.nio.AbstractNioChannel
org.graylog.integrations.s3.CompressionType
io.netty.channel.DefaultChannelId
io.netty.buffer.AbstractByteBuf
io.netty.channel.ChannelInitializer
io.netty.util.concurrent.SingleThreadEventExecutor
org.graylog.integrations.s3.codec.CloudflareLogCodec
io.netty.util.NetUtilInitializations
org.graylog2.gelfclient.transport.GelfTcpTransport
io.netty.util.NetUtil
io.netty.util.concurrent.DefaultPromise.rejectedExecution
io.netty.util.internal.CleanerJava6
io.netty.util.ResourceLeakDetector
io.netty.buffer.PooledByteBufAllocator
io.netty.util.internal.MacAddressUtil
io.netty.util.internal.SystemPropertyUtil
io.netty.util.Recycler
io.netty.channel.ChannelOutboundBuffer
io.netty.util.concurrent.DefaultPromise
io.netty.bootstrap.Bootstrap
io.netty.resolver.AddressResolverGroup
io.netty.buffer.ByteBufUtil
io.netty.buffer.PoolThreadCache
org.graylog.integrations.s3.S3EventProcessor
io.netty.util.ReferenceCountUtil
io.netty.channel.nio.NioEventLoop
org.graylog.integrations.s3.S3ScannerFactory
io.netty.util.ResourceLeakDetectorFactory
org.graylog.integrations.s3.GraylogS3Function
org.graylog.integrations.s3.ProtocolType
org.graylog.integrations.s3.ContentType
io.netty.util.internal.logging.InternalLoggerFactory
io.netty.channel.DefaultChannelPipeline
io.netty.channel.ChannelHandlerMask
2022-09-22 16:16:40 abb0f30a-4e9a-4d4d-96da-8bd232b19d01 ERROR GelfSenderThread:175 - Failed to flush messages in [100] attempts. Shutting down anyway.
END RequestId: abb0f30a-4e9a-4d4d-96da-8bd232b19d01
REPORT RequestId: abb0f30a-4e9a-4d4d-96da-8bd232b19d01  Duration: 32072.92 ms   Billed Duration: 32073 ms   Memory Size: 384 MB Max Memory Used: 194 MB
START RequestId: 5ff803dd-ba44-4d83-881e-1a63b1719b9d Version: $LATEST
io.netty.channel.socket.nio.NioSocketChannel
org.graylog2.gelfclient.encoder.GelfMessageJsonEncoder
io.netty.channel.MultithreadEventLoopGroup
io.netty.util.internal.InternalThreadLocalMap
io.netty.util.internal.PlatformDependent0
io.netty.channel.AbstractChannel
io.netty.util.concurrent.AbstractEventExecutor
io.netty.util.concurrent.GlobalEventExecutor
org.graylog2.gelfclient.transport.AbstractGelfTransport
io.netty.util.internal.PlatformDependent
org.graylog2.gelfclient.transport.GelfSenderThread
com.github.joschi.jadconfig.JadConfig
io.netty.channel.AbstractChannelHandlerContext
io.netty.channel.nio.AbstractNioChannel
org.graylog.integrations.s3.CompressionType
io.netty.channel.DefaultChannelId
io.netty.buffer.AbstractByteBuf
io.netty.channel.ChannelInitializer
io.netty.util.concurrent.SingleThreadEventExecutor
org.graylog.integrations.s3.codec.CloudflareLogCodec
io.netty.util.NetUtilInitializations
org.graylog2.gelfclient.transport.GelfTcpTransport
io.netty.util.NetUtil
io.netty.util.concurrent.DefaultPromise.rejectedExecution
io.netty.util.internal.CleanerJava6
io.netty.util.ResourceLeakDetector
io.netty.buffer.PooledByteBufAllocator
io.netty.util.internal.MacAddressUtil
io.netty.util.internal.SystemPropertyUtil
io.netty.util.Recycler
io.netty.channel.ChannelOutboundBuffer
io.netty.util.concurrent.DefaultPromise
io.netty.bootstrap.Bootstrap
io.netty.resolver.AddressResolverGroup
io.netty.buffer.ByteBufUtil
io.netty.buffer.PoolThreadCache
org.graylog.integrations.s3.S3EventProcessor
io.netty.util.ReferenceCountUtil
io.netty.channel.nio.NioEventLoop
org.graylog.integrations.s3.S3ScannerFactory
io.netty.util.ResourceLeakDetectorFactory
org.graylog.integrations.s3.GraylogS3Function
org.graylog.integrations.s3.ProtocolType
org.graylog.integrations.s3.ContentType
io.netty.util.internal.logging.InternalLoggerFactory
io.netty.channel.DefaultChannelPipeline
io.netty.channel.ChannelHandlerMask
END RequestId: 5ff803dd-ba44-4d83-881e-1a63b1719b9d
REPORT RequestId: 5ff803dd-ba44-4d83-881e-1a63b1719b9d  Duration: 90018.11 ms   Billed Duration: 90000 ms   Memory Size: 384 MB Max Memory Used: 194 MB
2022-09-22T16:18:10.843Z 5ff803dd-ba44-4d83-881e-1a63b1719b9d Task timed out after 90.02 seconds

Hope this helps.

danotorrey commented 1 year ago

Thanks @joberthar, that info helps a lot. The number of retries and wait time between retries is configurable (by environment variables). These can be set as desired depending on how long you would like the function to keep trying before giving up. Perhaps the number of retries can be reduced in your case.

SHUTDOWN_FLUSH_TIMEOUT_MS (default 100) SHUTDOWN_FLUSH_RETRIES (default 600)

The connect timeout is also configurable CONNECT_TIMEOUT (default 10000).

The jar file name graylog-s3-lambda-060821netty4.1.65-1.0.1.jar does not look like one of our official release artifacts. Perhaps that was build separately by someone with other changes to the source/dependencies. Can you please test with our latest official release version 1.0.5 (https://github.com/Graylog2/graylog-s3-lambda/releases/tag/1.0.5)? All dependencies were updated to the latest versions (including to resolve the Log4Shell CVE-2021-44832.

danotorrey commented 1 year ago

When I update to 1.0.5, I must increase memory to at least 384 MB to prevent it from producing the Exception in thread "main" java.lang.Error: java.lang.OutOfMemoryError: Metaspace error.

@joberthar Apologies, I see that you have tried 1.0.5 already. Can you let me know if adjusting the retry and timeout parameters help with memory usage?

danotorrey commented 1 year ago

@joberthar Can you also please try to reduce the TCP_MAX_IN_FLIGHT_SENDS value (defaults to 512) to 256 or 128 and see if that reduces memory pressure? This should effectively cause the message parse/send operation to wait until the indicated number of messages are sent before reading more from the file (should limit the number of messages in memory at any given time).

Also, how big is each message line? Just trying to get an estimate of total message size.

ghost commented 1 year ago

Also, how big is each message line? Just trying to get an estimate of total message size.

The Cloudflare logs are send in log.gz.json format and are a few KB in size, but uncompressed can be a few MB.

This is an example line from the file:

{"ClientIP":"71.255.0.0","ClientRequestHost":"example.com","ClientRequestMethod":"GET","ClientRequestURI":"/enhanced-domains/af24fd6547bc0469b2f174466b750406.gz.aes","EdgeResponseBytes":12200,"EdgeResponseStatus":200,"RayID":"74de88307bf8300c","CacheCacheStatus":"hit","ClientDeviceType":"desktop","ClientRequestBytes":1961,"ClientRequestPath":"/enhanced-domains/af24fd6547bc0469b2f174466b750406.gz.aes","ClientRequestProtocol":"HTTP/1.1","ClientRequestUserAgent":"Mozilla/6.0","EdgeColoCode":"BOS","EdgePathingOp":"wl","EdgeRequestHost":"example.com","EdgeResponseContentType":"application/octet-stream","OriginResponseStatus":0,"ClientRequestScheme":"https","EdgePathingSrc":"macro","EdgePathingStatus":"nr","EdgeStartTimestamp":"2022-09-21T00:02:25Z","WorkerSubrequest":false,"ParentRayID":"00","ClientRequestReferer":"","ClientRequestSource":"eyeball","WAFAction":"unknown","WAFRuleID":"","WAFRuleMessage":"","OriginResponseHTTPLastModified":"","SmartRouteColoID":0,"EdgeRateLimitAction":"","EdgeRateLimitID":0,"UpperTierColoID":0,"CacheTieredFill":false,"OriginResponseDurationMs":0,"ClientSSLProtocol":"TLSv1.2","ClientSSLCipher":"ECDHE-ECDSA-AES256-GCM-SHA384","ClientXRequestedWith":"","OriginSSLProtocol":"unknown","OriginIP":"","OriginResponseHTTPExpires":"","WorkerCPUTime":1167,"WorkerStatus":"ok","WorkerSubrequestCount":0}

Setting TCP_MAX_IN_FLIGHT_SENDS to 128 doesn't help resolve out of memory error. I tried to increase the memory to 320 but that still didn't help.

These are the error graphs from switching from 1.0.1 to 1.0.5. https://www.screencast.com/t/x7aCqE6cYLBH https://www.screencast.com/t/DiumqAASJG9

I haven't adjusted the SHUTDOWN_FLUSH_RETRIES or SHUTDOWN_FLUSH_TIMEOUT_MS because the default configuration is assuming a 60-second timeout for the entire function (which is where I'd like to keep it unless another value is preferred, but then only if memory can be reduced).

As a note, I am currently using Java 8 on Amazon Linux 1. Should I try using a different runtime?

ghost commented 1 year ago

I've attached a log export in case it might be helpful. What's curious is that I set the memory to 320, but it never uses close to that. If I reduce it from 320, it starts giving memory errors. I did find that 320 does seem to work.

I've attached a log showing failure even with a 2-minute timeout.

ghost commented 1 year ago

@danotorrey So I got many of the errors to stop. These are my configuration settings:

Environmental variables

{
    "CLOUDFLARE_LOGPUSH_MESSAGE_SUMMARY_FIELDS": "ClientRequestHost, ClientRequestPath, EdgeResponseStatus, ClientIP, EdgeResponseBytes, CacheCacheStatus, EdgeColoCode, RayID",
    "COMPRESSION_TYPE": "gzip",
    "CONTENT_TYPE": "application/x.cloudflare.log",
    "GRAYLOG_HOST": "172.22.1.40",
    "GRAYLOG_PORT": "12201",
    "LOG_LEVEL": "INFO",
    "SHUTDOWN_FLUSH_RETRIES": "600",
    "SHUTDOWN_FLUSH_TIMEOUT_MS": "300",
    "TCP_MAX_IN_FLIGHT_SENDS": "1024",
    "TCP_QUEUE_SIZE": "1024"
}

The memory is set at 304 MB and the timeout is 3 minutes. I still sometimes see a timeout (even after 3 minutes), which seems really strange.

I increased (rather than decreased) the TCP_MAX_IN_FLIGHT_SENDS setting to see if that might help send more files so that it wouldn't hit a timeout. I think this might have helped, but it's really difficult to determine.

ghost commented 1 year ago

@danotorrey What are the consequences of the SHUTDOWN FLUSH failing?

danotorrey commented 1 year ago

@joberthar Sorry for the delay. Thanks for continuing to test and for all the feedback/results.

What are the consequences of the SHUTDOWN FLUSH failing?

Any in-flight messages could potentially be discarded. We definitely want to try to get the shutdown flush to be successful. The relevant logic is here: https://github.com/Graylog2/gelfclient/blob/2a0b7bed57c022f8ce77d7def12f8bff9209e6c1/src/main/java/org/graylog2/gelfclient/transport/GelfSenderThread.java#L147-L176

I increased (rather than decreased) the TCP_MAX_IN_FLIGHT_SENDS setting to see if that might help send more files so that it wouldn't hit a timeout. I think this might have helped, but it's really difficult to determine.

Interesting. How many messages are we typically receiving in each file? From a previous log file, it looked like around between 200-600. It does not seem like a particularly large amount of data to send.

So, is it correct that there are two issues? One with memory usage, and another with the shutdown flush failing?

From looking at the latest log, one puzzling thing I am seeing is the time between Sent [143] messages and the flush failure appears to be less than a second. Based on the above config values, (300ms timeout, 600 retries) I would have expected a much longer flush wait. Do you have any idea why this might be?

17:37:18 S3EventProcessor:45 - Reading object [logpush/20220923T173354Z_20220923T173424Z_063ccfe0.log.gz] from bucket [bucket-name-redacted] 17:37:21 S3EventProcessor:94 - Sent [143] messages. 17:38:21 GelfSenderThread:175 - Failed to flush messages in [600] attempts. Shutting down anyway.

ghost commented 1 year ago

@danotorrey Increasing TCP_MAX_IN_FLIGHT_SENDS didn't actually help anything.

Interesting. How many messages are we typically receiving in each file? From a previous log file, it looked like around between 200-600. It does not seem like a particularly large amount of data to send.

The log file I provided is a typical log file. Cloudflare regularly sends lots of files to S3 rather than waiting to send a larger file. I would say that there would never be more than 1000 messages in any given log since Cloudflare is more likely to roll them up into multiple files in the log push.

So, is it correct that there are two issues? One with memory usage, and another with the shutdown flush failing?

Yes, the biggest issue is the memory issue. Ideally, I would think that the function could run on as little as 128 MB, but needing to be over 320MB seems like a relatively large amount for processing log messages.

The shutdown flush issue is also concerning since it wasn't an issue in v 1.0.1. Were there any changes that would have required additional memory or that would have caused variables to need changing?

ghost commented 1 year ago

As a side note, when I return to the previous version, I see these types of errors (disconnect messages, etc.) . They don't seem to be causing any issue though and I don't notice any logs missing.

LOGS    Name: cloudwatch_lambda_agent   State: Subscribed   Types: [platform]
--
Transforming org/apache/logging/log4j/core/lookup/JndiLookup (java.net.URLClassLoader@5197848c)
EXTENSION   Name: cloudwatch_lambda_agent   State: Ready    Events: [SHUTDOWN,INVOKE]
START RequestId: a80cf579-9a35-424c-bd38-710bcdb7cf63 Version: $LATEST
2022-09-28 17:10:11 a80cf579-9a35-424c-bd38-710bcdb7cf63 INFO  GraylogS3Function:87 - Log level is now set to [INFO].
org.graylog.integrations.s3.codec.CodecProcessor
org.graylog.integrations.s3.CompressionType
com.github.joschi.jadconfig.JadConfig
org.graylog.integrations.s3.GraylogS3Function
org.graylog.integrations.s3.ProtocolType
org.graylog.integrations.s3.ContentType
2022-09-28 17:10:32  INFO  GelfTcpTransport:115 - Channel disconnected!
2022-09-28 17:10:42  ERROR GelfTcpTransport:138 - Connection failed: executor not accepting a task
2022-09-28 17:10:44 a80cf579-9a35-424c-bd38-710bcdb7cf63 INFO  GraylogS3Function:64 - Processed [1] S3 records.
END RequestId: a80cf579-9a35-424c-bd38-710bcdb7cf63
REPORT RequestId: a80cf579-9a35-424c-bd38-710bcdb7cf63  Duration: 34116.29 ms   Billed Duration: 34117 ms   Memory Size: 304 MB Max Memory Used: 181 MB Init Duration: 1835.02 ms   XRAY TraceId: 1-63347fef-756931c16af5fd7521466d51   SegmentId: 0eb69e8107033dbd Sampled: true
danotorrey commented 1 year ago

@joberthar Can you please try to reduce the TCP_QUEUE_SIZE to a much smaller value (say 4, 8, or 16), and see if the memory usage is reduced?

This controls the number of messages where can enqueue in memory before a send is attempted. This is effectively a buffer. The larger the value, the more parsed messages will be loaded in memory waiting transmission. Lower values should create back-pressure and read slower from the file from S3. The default value is 512, and I have a feeling that could be responsible for the high memory usage (a somewhat unconstrained flood of messages into memory - we'd rather them stream in and out at a rate that keeps memory usage low with sufficient messages in the queue so the sender always has messages ready to send over the wire).

ghost commented 1 year ago

@danotorrey Updating the TCP_QUEUE_SIZE to 4 doesn't help with the memory issue.

This is the most recent log after making the change to a reduced memory size (this size was recommended by the AWS Compute Optimizer).

LOGS    Name: cloudwatch_lambda_agent   State: Subscribed   Types: [platform]
Transforming org/apache/logging/log4j/core/lookup/JndiLookup (java.net.URLClassLoader@5197848c)
EXTENSION   Name: cloudwatch_lambda_agent   State: Ready    Events: [INVOKE,SHUTDOWN]
START RequestId: d09bb2c9-2d5f-4787-8d48-d0e99d7d4619 Version: $LATEST
2022-09-28 18:56:36 d09bb2c9-2d5f-4787-8d48-d0e99d7d4619 INFO  GraylogS3Function:65 - Log level is now set to [INFO].
io.netty.channel.socket.nio.NioSocketChannel
org.graylog2.gelfclient.encoder.GelfMessageJsonEncoder
io.netty.channel.MultithreadEventLoopGroup
io.netty.util.internal.InternalThreadLocalMap
io.netty.util.internal.PlatformDependent0
io.netty.channel.AbstractChannel
io.netty.util.concurrent.AbstractEventExecutor
io.netty.util.concurrent.GlobalEventExecutor
org.graylog2.gelfclient.transport.AbstractGelfTransport
io.netty.util.internal.PlatformDependent
org.graylog2.gelfclient.transport.GelfSenderThread
com.github.joschi.jadconfig.JadConfig
io.netty.channel.AbstractChannelHandlerContext
io.netty.channel.nio.AbstractNioChannel
org.graylog.integrations.s3.CompressionType
io.netty.channel.DefaultChannelId
io.netty.channel.ChannelInitializer
io.netty.util.concurrent.SingleThreadEventExecutor
org.graylog.integrations.s3.codec.CloudflareLogCodec
io.netty.util.NetUtilInitializations
org.graylog2.gelfclient.transport.GelfTcpTransport
io.netty.util.NetUtil
io.netty.util.concurrent.DefaultPromise.rejectedExecution
io.netty.util.internal.CleanerJava6
io.netty.util.ResourceLeakDetector
io.netty.buffer.PooledByteBufAllocator
io.netty.util.internal.MacAddressUtil
io.netty.util.internal.SystemPropertyUtil
io.netty.channel.ChannelOutboundBuffer
io.netty.util.concurrent.DefaultPromise
io.netty.bootstrap.Bootstrap
io.netty.resolver.AddressResolverGroup
io.netty.buffer.ByteBufUtil
io.netty.buffer.PoolThreadCache
org.graylog.integrations.s3.S3EventProcessor
io.netty.channel.nio.NioEventLoop
org.graylog.integrations.s3.S3ScannerFactory
org.graylog.integrations.s3.GraylogS3Function
org.graylog.integrations.s3.ProtocolType
org.graylog.integrations.s3.ContentType
io.netty.util.internal.logging.InternalLoggerFactory
io.netty.channel.DefaultChannelPipeline
io.netty.channel.ChannelHandlerMask
2022-09-28 18:56:36 d09bb2c9-2d5f-4787-8d48-d0e99d7d4619 INFO  S3EventProcessor:45 - Reading object [logpush/20220928T185559Z_20220928T185629Z_2de9dcb3.log.gz] from bucket [bucket-name-redacted]
Exception in thread "main" 
java.lang.Error: java.lang.OutOfMemoryError: Metaspace
at lambdainternal.AWSLambda.<clinit>(AWSLambda.java:72)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:150)
Caused by: java.lang.OutOfMemoryError: Metaspace
END RequestId: d09bb2c9-2d5f-4787-8d48-d0e99d7d4619
REPORT RequestId: d09bb2c9-2d5f-4787-8d48-d0e99d7d4619  Duration: 61199.58 ms   Billed Duration: 60000 ms   Memory Size: 282 MB Max Memory Used: 196 MB Init Duration: 4464.41 ms   
XRAY TraceId: 1-633498de-33f9154e008b695872f70aba   SegmentId: 3e99b97a146109a8 Sampled: true   
2022-09-28T18:57:36.776Z d09bb2c9-2d5f-4787-8d48-d0e99d7d4619 Task timed out after 61.20 seconds

This is the file (contents redacted) that was being processed. 20220928T185559Z_20220928T185629Z_2de9dcb3.log.gz.json.zip

It seems strange that a file with fewer than 10 messages would time out.

danotorrey commented 1 year ago

Thanks @joberthar. Is it correct that the memory issues did not happen in version 1.0.1 and prior?

Java Metaspace contains Java class information needed for the application to run. The adjustments I talked about previously would probably not affect Metaspace.

The only changes since 1.0.1 are to update dependencies. Perhaps something in the dependencies has significantly changed the size/quantity of class metadata that needs to be loaded. We should be able to diagnose what with some profiling. I'll be out of the office the next days, but I can look into this more starting mid-to-late next week.

Thanks again for all of the time and effort helping us investigate this.

danotorrey commented 1 year ago

Could be related: https://stackoverflow.com/a/38918849/1624052 Should be able to confirm with profiling.

ghost commented 1 year ago

Is it correct that the memory issues did not happen in version 1.0.1 and prior?

Correct. This was not an issue in version 1.0.0 or 1.0.1.

I would also be curious if the runtime matters in this case as well. I am currently using Java 8 on Amazon Linux 1 but I am getting some messages saying to update to Java 8 on Amazon Linux 2 which I believe is the equivalent of the Correto version of Java 8. This could just be Amazon trying to get me to update (it really wants me to use Amazon Correto 11).

ghost commented 1 year ago

Hey @danotorrey Were you able to get back into this recently?