GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
610 stars 98 forks source link

[AC-4/SC-7] Manage egress traffic from SSB Broker Apps #4118

Open nickumia-reisys opened 1 year ago

nickumia-reisys commented 1 year ago

User Story

In order to increase our security posture, the Data.gov SSB Team wants to implement an egress-proxy in conjunction with our SSB Apps (ssb-solr, ssb-eks and ssb-smtp).

Acceptance Criteria

Background

Historical Issues/References

Security Considerations (required)

To satisfy POAM

Sketch

jbrown-xentity commented 1 year ago

my bad, wrong ticket

nickumia-reisys commented 1 year ago

Having explored a few options, I've decided that my original sketch is too complicated. It might be more logical to keep the same deployment procedure that we've been using for our other apps. Irrespective of this fact, the following steps need to be taken (admin -- @btylerburton @Jin-Sun-tts @jbrown-xentity):


Since we have the infrastructure already setup and optimized in this repo, I would suggest to add the ssb apps to this repo, but the two options are:


Create space-deployer accounts in egress spaces (@btylerburton @Jin-Sun-tts)

Each key needs to be added to the Github Secrets in datagov-ssb


Next step (regardless of last decision) is to:

*Note: I would consolidate these lists if practical.


Deploy all egress apps:


Once the apps are deployed and functioning, I believe all that's left to do is:

nickumia-reisys commented 1 year ago

As a foreboding warning, we may have memory usages following this deployment. We have 39G of constant app allocation, with rolling restarts and recurring tasks, the usage gets really close to our 50G limit. This work will add ~1GB to our constant app allocation. I fear it being just enough to be causing more random failures.

nickumia-reisys commented 1 year ago

Documenting here, but I'm sure it will get lost again. For the current egress deployments to work, the ci-deployer key that is created in the non-egress space needs to be given additional permission to the egress space. @btylerburton was able to assign the correct permissions and we did it through the cloud.gov dashboard UI.

robert-bryson commented 1 year ago

After the huddle yesterday, I poked around a bit and found that (for Terraform Enterprise at least) there is a list of domains that need egress. I added those to the allow.acl but it still had the same issue creating the service.

I have still been able to ping domains not on the allow list from the ssb-solrcloud app so I am suspicious that egress is working correctly.

nickumia-reisys commented 1 year ago

As a summary of where this stands...

General Workflow/Design

The ssb egress deployment is supposed to be an identical copy of the data.gov egress deployment. While the deployment process was not formally documented anywhere (including the historical issue mentioned above), the following statements highlight the key points:

SSB Specific Details

(Each broker app lives in cloud.gov) to create (services that live outside of cloud.gov) to be made available to (other apps in cloud.gov). For the following explanation, I will use the ssb-solrcloud broker app in development-ssb space as an example, but the same details apply to the rest of the brokers and all of the other spaces.

Appendix A

The egress-proxy uses caddy as a reverse proxy to control egress traffic. There are few modules and plugins that help caddy do its job. The five main components that are relevant are:

If you attempt to access something on the allowed list, you should get a 200. If you attempt to access something on the denied list, you should get a 403. When we say that the egress proxy is not doing what it was designed to do, essentially all domains are returning a 200.

Appendix B

Through trial and error, the farthest we've gotten is that the ssb-solrcloud app needs to talk to registry.terraform.io to download the vpc module. This most likely occurs during a terraform init. We've added all of the terraform specific routes, but have not seen the process get any further. This could be a compound issue with the problem highlighted by Appendix A, but it may or may not be related. We can access everything from the ssb-solrcloud app (container). But as I'm writing this, I'm wondering if there is a problem with the cloud-service-broker binary abstraction where it needs to be configured to talk to the egress proxy properly. The terraform process runs in the ssb-solrcloud app itself, it's not like a second layer of virtualization (at least I don't think so), so it shouldn't have any additional layers of abstraction to jump through (although it is a possibility). The reason I think this is unlikely is because sometimes the service retrieves some bytes, the connection just gets closed early... this is technically a lead, but not much to work with.

Given that we have not made it past the terraform init, we still don't know what's necessary for the terraform apply and as equally as important the terraform destroy. For eks, there may be additional dependencies since it uses helm and kubectl which downloads charts and other resources. @robert-bryson comment above highlights part of a solution, but it is not the entire solution.

All in all, this is a very complex technology stack with a lot of peculiarities, I will not call it elegant, but it is a wonder how it all works. I wish I could offer more help, but this is all that I have about the problem currently.

nickumia-reisys commented 1 year ago

List of possible issues (none of which I have been able to verify or deny)

nickumia-reisys commented 1 year ago

Investigated TF Possible Environment Variable workarounds.. Nothing seems to inhibit or support the use of an egress proxy. I have a hunch that terraform is not using the egress proxy properly. It seems like the egress proxy is working properly now in that domains in the allowed list are allowed and everything else is denied.

nickumia-reisys commented 1 year ago

I fear we may have to implement something like this 😢 😭 ... https://github.com/jasonwbarnett/terraform-registry-proxy

nickumia-reisys commented 1 year ago

Although looking through the code, it doesn't seem to have anything special in it... For the sake of trying the only known option found above, the following steps will be taken as a proof-of-concept:

nickumia-reisys commented 1 year ago

Ssoooooo.... super complicated proof-of-concept (shoutout to @btylerburton for helping me focus through a critical part!)

~The terraform-registry proxy app is setup only for http use, so I tried to get the terraform code to send http requests to it, not https requests through the insecure = true configuration. It did not work. If we get terraform to send http requests, I have pretty high confidence that this will get it to work.~ However, I figured I would document this much for now while I still try other ways of getting it to work. All in all, if this does eventually work, we'll have to come up with a deployment framework for this new app terraform-registry which will be a separate headache altogether...

~This seems so close...!~

A list of references:

nickumia-reisys commented 1 year ago

Well... learning about terraform internals.. It seems like the discovery service can only be used with https.

It is strongly recommended to provide the discovery document for a hostname on the standard HTTPS port 443. However, in development environments this is not always possible or convenient, so Terraform allows a hostname to end with a port specification consisting of a colon followed by one or more decimal digits.

When a custom port number is present, the service on that port is expected to implement HTTPS and respond to the same fixed discovery path.

For day-to-day use it is strongly recommended not to rely on this mechanism and to instead provide the discovery document on the standard port, since this allows use of the most user-friendly hostname form.

With this information, it seems like the terraform-registry must be reworked to run on https... I don't think this is worth the effort (but I'll still try for now I guess). We would not only be incorporating another third-party service, but we would need to fork it to get it to work for us.

I don't know where to start with the alternative path of trying to include the vpc module so that terraform doesn't need to download anything. From my preliminary analysis, it seem like modules cannot be packaged in the buildpak in the same way as terraform sources and binaries.

nickumia-reisys commented 1 year ago

As more context, the thing that we are trying get to terraform is displayed in the following screenshot: As opposed to the providers which are all packaged in the broker, the modules are not.. image image

Also note: This is just for the solr brokerpak. This would need to be done for the eks brokerpak too which has more modules.. image

nickumia-reisys commented 1 year ago

We could manually install the modules dir and disable the terraform call to HashiCorp-provided network services using the .terraformrc. This seems a bit hack-y though.. And then I'm not sure if there will be other things terraform will try to talk to and get blocked.

nickumia-reisys commented 1 year ago

As a summary of possible next steps:

mogul commented 1 year ago

From way above:

But as I'm writing this, I'm wondering if there is a problem with the cloud-service-broker binary abstraction where it needs to be configured to talk to the egress proxy properly.

From googling, it looks like Terraform understands and uses HTTPS_PROXY. I'm wondering if this could be resolved by ensuring the HTTPS_PROXY env var is propagated all the way into the CSB's invocation of Terraform. (It looks like no env vars are propagated currently; I asked about that in the CF Slack.)

hkdctol commented 1 year ago

Awaiting some information from upstream on vendoring modules

nickumia-reisys commented 1 year ago

Ongoing discussion with upstream CSB Team: https://cloudfoundry.slack.com/archives/C0164KKEZGX/p1675103503770689

nickumia-reisys commented 1 year ago

As an update on discoveries and current sticking points:

None of those solutions seem enticing to me, but I'll leave it up to the team to come up with more or decide from these.

jbrown-xentity commented 1 year ago

Summary Doc @hkdctol

hkdctol commented 1 year ago

ISSO advises we start work on AOR, at https://docs.google.com/document/d/11iNXSSk556VA1fhx6KaeYmw-qCktypjh/edit?usp=sharing&ouid=113511966954817069922&rtpof=true&sd=true

hkdctol commented 1 year ago

Draft AOR sent to ISSO

hkdctol commented 1 year ago

Met with ISSM on 8/9, this is pending CISO return from leave in a few weeks

rpalmer-gsa commented 1 year ago

@nickumia-reisys

Couple of questions:

  1. Is this still going though the proxy with an Allow * from a domain standpoint, or not going though the proxy at all?

  2. _"Accept that the module cannot be packaged and work on the actual HTTPSPROXY issue This requires more debugging of the CSB and analyzing what specific requests are being blocked/not handled properly or otherwise lost in translation."

If you send it though an Allow Any proxy and log the URLs would this be a means to populate the list of all URLS that are needed? Is this due to blocks, or is it due to usage of the proxy itself, its a little unclear from above.

nickumia-reisys commented 1 year ago

@rpalmer-gsa

Is this still going though the proxy with an Allow * from a domain standpoint, or not going though the proxy at all?

Currently, there are no egress applications set up. Is there a desire to have an Allow * proxy app?

If you send it though an Allow Any proxy and log the URLs would this be a means to populate the list of all URLS that are needed?

The issue is not "what are the domains/urls that need to get through".

Is this due to blocks, or is it due to usage of the proxy itself, its a little unclear from above.

The issue is that the CSB is not recognizing the proxy environment variables, so it doesn't know to send traffic through the proxy. Because all egress is disallowed except through the proxy, the requests do not leave the container at all.. I assume they get stuck at some network level "waiting to be sent" ...or... the container thinks it sent it and is waiting for a response that never comes.

hkdctol commented 8 months ago

Met with ISSM on 1/22 - he will follow up on AOR status

gujral-rei commented 4 months ago

Is this item up for review as part of the assessment?