bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.81k stars 520 forks source link

[BUG] Compatibility issues with the v1.26.0 of Bottlerocket when running NodeJS applications #4261

Closed emanuelflp closed 5 days ago

emanuelflp commented 1 month ago

Image I'm using: bottlerocket-aws-k8s-1.30-aarch64-v1.26.0-85f0d68c

What I expected to happen: All nodes using the latest Bottlerocket AMI(1.26) should be able to run NodeJS based pods without any issues.

What actually happened: When Karpenter rolled out new nodes using the latest Bottlerocket AMI, all the NodeJS based pods placed in the new nodes are crashing:

With the below errors:

#
# Fatal error in , line 0
# Check failed: 12 == (*__errno_location ()).
#
#
#
#FailureMessage Object: 0xffffe447dab0
----- Native stack trace -----

 1: 0xfb5f38  [/usr/local/bin/node]
 2: 0x257c370 V8_Fatal(char const*, ...) [/usr/local/bin/node]
 3: 0x2586f08 v8::base::OS::SetPermissions(void*, unsigned long, v8::base::OS::MemoryPermission) [/usr/local/bin/node]
 4: 0x143776c v8::internal::MemoryAllocator::SetPermissionsOnExecutableMemoryChunk(v8::internal::VirtualMemory*, unsigned long, unsigned long, unsigned long) [/usr/local/bin/node]
 5: 0x1437b04 v8::internal::MemoryAllocator::AllocateAlignedMemory(unsigned long, unsigned long, unsigned long, v8::internal::AllocationSpace, v8::internal::Executability, void*, v8::internal::VirtualMemory*) [/usr/local/bin/node]
 6: 0x1437d04 v8::internal::MemoryAllocator::AllocateUninitializedChunkAt(v8::internal::BaseSpace*, unsigned long, v8::internal::Executability, unsigned long, v8::internal::PageSize) [/usr/local/bin/node]
 7: 0x1437eb8 v8::internal::MemoryAllocator::AllocatePage(v8::internal::MemoryAllocator::AllocationMode, v8::internal::Space*, v8::internal::Executability) [/usr/local/bin/node]
 8: 0x14545ec v8::internal::PagedSpaceBase::TryExpand(v8::internal::LocalHeap*, v8::internal::AllocationOrigin) [/usr/local/bin/node]
 9: 0x14092a8 v8::internal::PagedSpaceAllocatorPolicy::RefillLab(int, v8::internal::AllocationOrigin) [/usr/local/bin/node]
10: 0x1407578 v8::internal::MainAllocator::AllocateRawSlow(int, v8::internal::AllocationAlignment, v8::internal::AllocationOrigin) [/usr/local/bin/node]
11: 0x13abee8 v8::internal::Factory::CodeBuilder::AllocateUninitializedInstructionStream(bool) [/usr/local/bin/node]
12: 0x13c31dc v8::internal::Factory::CodeBuilder::BuildInternal(bool) [/usr/local/bin/node]
13: 0x18e90bc v8::internal::baseline::BaselineCompiler::Build(v8::internal::LocalIsolate*) [/usr/local/bin/node]
14: 0x120cbc8 v8::internal::GenerateBaselineCode(v8::internal::Isolate*, v8::internal::Handle<v8::internal::SharedFunctionInfo>) [/usr/local/bin/node]
15: 0x1266690 v8::internal::Compiler::CompileSharedWithBaseline(v8::internal::Isolate*, v8::internal::Handle<v8::internal::SharedFunctionInfo>, v8::internal::Compiler::ClearExceptionFlag, v8::internal::IsCompiledScope*) [/usr/local/bin/node]
16: 0x12669a8 v8::internal::Compiler::CompileBaseline(v8::internal::Isolate*, v8::internal::Handle<v8::internal::JSFunction>, v8::internal::Compiler::ClearExceptionFlag, v8::internal::IsCompiledScope*) [/usr/local/bin/node]
17: 0x18da644 v8::internal::baseline::BaselineBatchCompiler::CompileBatch(v8::internal::Handle<v8::internal::JSFunction>) [/usr/local/bin/node]
18: 0x1353064 v8::internal::TieringManager::OnInterruptTick(v8::internal::Handle<v8::internal::JSFunction>, v8::internal::CodeKind) [/usr/local/bin/node]
19: 0x17c3ca4  [/usr/local/bin/node]
20: 0x17c5a24 v8::internal::Runtime_BytecodeBudgetInterruptWithStackCheck_Ignition(int, unsigned long*, v8::internal::Isolate*) [/usr/local/bin/node]
21: 0x1e1ab54  [/usr/local/bin/node]

Workaround: Rolling back the nodes to the previous version v1.25.0 fixed the issue.

bcressey commented 1 month ago

The 1.26.0 release of Bottlerocket included a change to restrict system services from mapping memory as both writable and executable (https://github.com/bottlerocket-os/bottlerocket-core-kit/pull/158).

Although intended to apply only to the host software, which does not need this capability, the restriction also erroneously applied to applications running inside containers. Software relying on just-in-time (JIT) compilation, such as Java or NodeJS, often needs to mark memory as both writable and executable, and this change caused pods running Java and NodeJS applications to fail.

To mitigate the impact, the 1.26.0 release has been rolled back and 1.25.0 is now marked as latest.

jemc commented 1 month ago

Please note that the bottlerocket-1.26.0-based AMIs on AWS (e.g. bottlerocket-aws-k8s-1.30.x86_64-v1.26.0-85f0d68c which bit us this morning) are still active/available so this issue will still be impacting users.

It might be worth thinking through how to propagate retracted releases downstream to whomever publishes these AMIs and has the ability to retract them as well..

jemc commented 1 month ago

Or perhaps it's worth considering that the best way to roll back a release in practice may be to cut a new release (with a higher version number) that will supercede the old/bad version in all downstream systems.

If you had done that, we wouldn't have had to version lock to an old version (and miss out on security updates until we unlock). Or perhaps we wouldn't have had the issue at all, since our badly-timed upgrade was 10 hours after you already retracted the release here in the source repo (but the AMI remained/remains active).

larvacea commented 1 month ago

@jemc, thank you for the suggestions. Bottlerocket does provide a mechanism for choosing AMI IDs that you might consider: AMI IDs as public SSM parameters (see, for instance, the QUICKSTART-EKS.md file in this repo for details). When a release is published, the the latest SSM parameter is. updated to the new AMI ID, and if a release is rolled back, that parameter is changed to the previous AMI ID. This change can propagate very quickly, and in your particular case, if you were updating ten hours after the issue was found, you would have seen the latest SSM parameter for the rollback (previous version) AMI ID. I hope this helps, going forward.

jemc commented 1 month ago

Thanks for the info - I'll take a look!

koooosh commented 5 days ago

Closing this issue as the fix for this (referenced above) was released in Bottlerocket v1.26.1: https://github.com/bottlerocket-os/bottlerocket/releases/tag/v1.26.1