US-GHG-Center / ghgc-stac-ingestor

GHGC STAC Ingestion Registry
Other
0 stars 0 forks source link

Figure out why requests are timing out in smce dev environment #6

Closed slesaad closed 1 year ago

slesaad commented 1 year ago

Description

In the dev environment (SMCE), any request made to the API is timing out.

Acceptance Criteria

slesaad commented 1 year ago

We've started seeing this problem in the dev environment too. The link: http://dev.ghg.center/api/publish

The ingestor is deployed in the veda-smce account, the required env vars are in the secrets manager in the same account.

This suddenly started happening without any updates/changes made in the deployment. Everything was working fine, then one day, it started timing out all of a sudden. Since the staging env was also timing out, the problems are probably related.

I haven't had a chance to look into it deeply, but the GET /ingestions (which doesn't need authentication) seems to not time out. So I wonder if the issue is with cognito authentication. /token is for sure timing out. Maybe start out by getting a token from cognito directly, and trying some other requests?

anayeaye commented 1 year ago

I didn't figure this out but I did visit the ghgc dev logs and compared behavior to the current veda dev stac ingestor so I am dropping some notes here to pick up later.

ranchodeluxe commented 1 year ago

Some notes on the steps for debugging and solving this issue:

Somewhere in the above workflow I noticed the NAT had a strangely named SG attached to it called SMCE-Isolate. Knowing that SMCE adds permission boundaries to IAM users and probably implements other security measures it made sense they might've provisioned a security rule and that's why things suddenly started breaking. I saw this SG didn't have any inbound traffic rules so I added: IPv4 | All traffic | All | All | 10.41.0.0/16. But this has potential security issues and not sure what to do here

ranchodeluxe commented 1 year ago

Some notes on the steps for debugging and solving this issue:

* deployed a simple Lambda in the same VPC and private subnets that just made public request and saw it still timed out

* that meant something about the NAT gateway was wrong

* I looked at the private subnet route tables and made sure they had rules targeting the NAT ENI

* I made sure the ENI Source/Destination was diabled

* I checked the rules on the network ACL

* I enabled flow logs and saw that traffice to our ENI was getting REJECT statements

Somewhere in the above workflow I noticed the NAT had a strangely named SG attached to it called SMCE-Isolate. Knowing that SMCE adds permission boundaries to IAM users and probably implements other security measures it made sense they might've provisioned a security rule and that's why things suddenly started breaking. I saw this SG didn't have any inbound traffic rules so I added: IPv4 | All traffic | All | All | 10.41.0.0/16. But this has potential security issues and not sure what to do here

More info:

CloudTrail tells me someone added this group on July 11th. Do @slesaad or @amarouane-ABDELHAK know if this was them? I think this is SMCE

I was about to write to SMCE but I'll wait for someone to confirm or deny before we reach out

ranchodeluxe commented 1 year ago

Okay, putting a theory out into the universe here to see what comes back:

amarouane-ABDELHAK commented 1 year ago

To prevent this from happening in the future, we should ask SMCE administrators to provision a shared VPC so it can be referred to by all the other custom services and resources that we will deploy. To prevent similar security risks from occurring in the future and improve overall network security management, I would recommend implementing a shared VPC architecture (either by us in UAH accounts probably using (this repo) or by SMCE administrators [them doing it for us] ). By provisioning a shared VPC, it can serve as a central hub for hosting common resources and services, including the NAT instance. This approach allows us to enforce consistent security measures across all the resources in the VPC.

ranchodeluxe commented 1 year ago

Okay, putting a theory out into the universe here to see what comes back:

* CDK must by default add a security group to the NAT with inbound rules: `IPv4 | All traffic | All | All | 0.0.0.0/0`

* SMCE must run an EC2 security audit on these types of instances. Then flagged and changed the SG and removed the inbound rules b/c it probably is a security risk

* My hunch here is we should be provisioning our NAT SG with inbound rules that allow only inbound from VPC CIDR range: `IPv4 | All traffic | All | All | 10.41.0.0/16`

My hunch was correct and SMCE staff confirmed it