ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
493 stars 582 forks source link

AWS IMDS Token is not being reused until it expires #1990

Open harshkiprofile opened 2 weeks ago

harshkiprofile commented 2 weeks ago

The script currently generates an IMDS token with a 6-hour (21,600 seconds) validity, but it lacks logic to reuse the token until it expires. Instead, it fetches a new token on every call, which could be optimized.

harshkiprofile commented 2 weeks ago

image

thimslugga commented 2 weeks ago

This new change introduces security concerns as it's writing the session token to a file under /var/run and care must be taken to keep this safe and ensure permissions set correctly for read/write i.e. umask, etc.

Ideally the session token ttl time should be much lower i.e. ~10 min or less and not using the max ttl of 6 hours.

I have a fork of the aws-vpc-move-ip, which introduces some changes and attempts to get the instance ID from local file and falls back to the IMDS if that file is not accessible e.g. non-nitro based EC2 instance, etc.

https://github.com/ClusterLabs/resource-agents/compare/main...thimslugga:resource-agents:aws-dev

oalbrigt commented 2 weeks ago

This new change introduces security concerns as it's writing the session token to a file under /var/run and care must be taken to keep this safe and ensure permissions set correctly for read/write i.e. umask, etc.

We can just chmod 600 the file after saving the token.

thimslugga commented 2 weeks ago

We can also retrieve the instance ID from the /sys/devices/virtual/dmi/id/board_asset_tag file on Nitro-based EC2 instances. This would prevent always having to make curl requests to retrieve this info and we can fallback to IMDS.

https://github.com/thimslugga/resource-agents/blob/cadd53c8a5c019f1679bda47d3d5b70471af4d3b/heartbeat/aws-vpc-move-ip#L286-L293

The sleep and wait time should also be bumped from the current 3 and 1 to something a bit higher in event the IMDS is temporarily unavailable e.g. ~10s.

oalbrigt commented 2 weeks ago

Nice. I'll make a patch with the suggestions next week.

oalbrigt commented 1 week ago

I've made a patch based on your suggestions: https://github.com/ClusterLabs/resource-agents/pull/1995

gguifelixamz commented 1 week ago

What's the actual difference between reading it from a file and making a request to the IMDS metadata? Before merging this I would like to hear from the requester @harshkiprofile what's the use-case and what's the level of optimization achieved from this change.

Were any issues observed by fetching a token every time?

Writing a plain text token (in fact, any kind of secret) to a file is generally a bad idea, even if you set exclusive permissions.

gguifelixamz commented 1 week ago

@oalbrigt can we revert PR1991until we have the requestor to clear up a few questions about the use-case? I don't think we should upstream this feature without some background context with a change of security and usability issues.

harshkiprofile commented 1 week ago

The decision to reuse the token by storing it locally is based on the fact that during AWS EC2 hypervisor maintenance events, the IMDS metadata API call to fetch token can sometimes become temporarily unavailable, even if the EC2 instance itself remains operational. If a new token were fetched with every request during such events, metadata token retrieval could fail. By storing the metadata token locally (temporarily), we ensure uninterrupted access to metadata, regardless of IMDS availability during maintenance windows, thus enhancing system reliability.

This approach prioritizes fault tolerance over pure performance optimization. In the event of AWS maintenance, local token storage allows continued metadata access, reducing downtime and mitigating the impact of transient IMDS issues.

While storing tokens or secrets in files does introduce potential risks, @oalbrigt has implemented restricted file permissions to minimize these concerns.

Performance-wise, no significant issues have been observed with fetching a token for each request. However, in environments where transient IMDS issues occur frequently, the local token storage method provides increased robustness, eliminating the need for constant retries to acquire a token.

That was the rationale behind this change. I would love to hear your feedback on this one! I will explore on what can we done to strengthen the security as I believe caching the token improves stability.

harshkiprofile commented 1 week ago

Another key rationale behind this change is that it helps prevent failures caused by AWS API call throttling.

image image

oalbrigt commented 1 week ago

@gguifelixamz The requests adds up when you have multiple resources. Say you have one or more fence agents (e.g. for different zones), then one or more of awseip, awsvip, aws-vpc-move-ip, and aws-vpc-route53.

We can revert it if we find a reason to undo it. If not we can do improvements based on your suggestions or pull requests.

Next release should be in january/february, so we got time to fix any issues before then.

thimslugga commented 1 week ago

By storing the metadata token locally (temporarily), we ensure uninterrupted access to metadata, regardless of IMDS availability during maintenance windows, thus enhancing system reliability.

The session token is used to make subsequent calls to the IMDS service. Having it stored locally would still result in calls to the IMDS failing if the service was unavailable...

Another key rationale behind this change is that it helps prevent failures caused by AWS API call throttling.

The IMDS != AWS EC2 API. The other functions in the script that make use of the aws-cli, will still result in the aws-cli requesting a session token from IMDS, retrieving credentials from the IMDS and then making the actual call to the AWS EC2 API.

The requests adds up when you have multiple resources. Say you have one or more fence agents (e.g. for different zones), then one or more of awseip, awsvip, aws-vpc-move-ip, and aws-vpc-route53.

See above, do not confuse the IMDS and AWS EC2 API as they are not the same thing.

harshkiprofile commented 1 week ago

The IMDS != AWS EC2 API. The other functions in the script that make use of the aws-cli, will still result in the aws-cli requesting a session token from IMDS, retrieving credentials from the IMDS and then making the actual call to the AWS EC2 API.

PPS limit != EC2 API Actually, this isn't related to AWS EC2 API throttling; by "via request throttling," I mean that certain limits (PPS), when reached, cause requests to be throttled. If you look at the highlighted text in the screenshot, you'll see it refers to the packets-per-second (PPS) limits being exceeded. This solution significantly helps reduce the consumed PPS limit as it reduce the recurring calls to IMDS to get the token.

image

As a user of resource agents, I encountered several challenges in my environment that led me to propose this suggestion. Implementing the recommendation has significantly enhanced the stability of the environment.

oalbrigt commented 5 days ago

@gguifelixamz I've added a commit to be able to refresh the token e.g. if the file is corrupt for some reason: https://github.com/ClusterLabs/resource-agents/pull/1995/commits/b8d3ecc6a8ce4baf4b28d02978dd573728ccf5fa