apify / got-scraping

HTTP client made for scraping based on got.
526 stars 40 forks source link

SSRF protection #47

Closed JaneJeon closed 3 years ago

JaneJeon commented 3 years ago

Hi, first of all, thanks for the library.

I'm trying to implement SSRF protection in-place (without relying on proxies), which requires modifying the http/https agents and passing the modified instances into got.

However, that's not really possible due to two things:

  1. got's extend functionality doesn't seemingly allow you to somehow get/override existing agent instances: https://github.com/sindresorhus/got/blob/9abdd06d72d90d034299dabd22544d14c897643b/test/merge-instances.ts
  2. The TransformHeadersAgent is never exported from this library, which makes it impossible for me to instantiate agents myself and pass it in.

This unfortunately prevents me from being able to use the library. Is there any way around this?

Thanks

JaneJeon commented 3 years ago

I closed it thinking I can just use a beforeRequest hook; however, there is no way for a hook to tap into options.dnsCache or options.dnsLookup (i.e. the same thing that got uses - https://github.com/sindresorhus/got/blob/08d7a0f3285d5ac535aac873cd9090937cf008b6/source/core/options.ts#L2290), so this unfortunately prevents DNS lookups using what options users have supplied to Got (I really don't want to re-create CacheableLookup or dns.lookup just for the purposes of looking up DNS rather than using what's already in the options).

So, unfortunately, this still means there is no way to provide SSRF protection w/o getting direct access to the agents (which, by the way, I can't access from hook options either) :(

szmarczak commented 3 years ago
  1. got's extend functionality doesn't seemingly allow you to somehow get/override existing agent instances
instance.defaults.options.agent.http;
instance.defaults.options.agent.https;
instance.defaults.options.agent.http2;

const newInstance = instance.extend({
    agent: {
        http: ...,
        https: ...,
        http2: ...,
    },
});

Disclaimer: Got maintainer here

  1. The TransformHeadersAgent is never exported from this library, which makes it impossible for me to instantiate agents myself and pass it in.

I'm not sure if we need to expose this but it seems like this issue could be fixed in Got?

/cc @mnmkng

however, there is no way for a hook to tap into options.dnsCache or options.dnsLookup

Can be done via overriding dnsCache.lookupAsync or specifying custom DNS servers. Opened https://github.com/szmarczak/cacheable-lookup/issues/41 as this would be a nice feature.


Alternatively you can do something like this:

import dns from 'dns';

const instance = gotScraping.extend({
    hooks: {
        beforeRequest: [
            async options => {
                const {address} = await dns.promises.lookup(options.url.hostname);
                if (isIpPrivate(address)) {
                    throw new Error('Forbidden');
                }
            }
        ]
    }
});

The cost is probably +20ms per request.

mnmkng commented 3 years ago

I'm not sure if we need to expose this but it seems like this issue could be fixed in Got?

Which part could be fixed in Got?

I'm also not a fan of exporting the agent because it's internal implementation. On the other hand it's true that it not being exported basically makes this library unusable with custom agents, since a lot of the important logic is in the agents and you have no way of keeping that. Adding an API that would allow agent chaining or some sort of agent middleware sounds like a crazy overkill for this library, but it could be something that makes sense in Got. If it's possible to do it with a nice interface.

JaneJeon commented 3 years ago

Oh wow, thanks for the quick response. I ended up just using the builtin dns.lookup, but I agree that the alternatives don’t look so pretty either.

Overriding agents is definitely ugly as @mnmkng points out as it’s internal implementation, getting opts.dnsLookup || opts.dnsCache[‘lookup’] seems to be not possible as of now as @szmarczak points out, and the solution I ended up on (just using builtin dns and relying on the system to cache DNS lookups) is also ugly for obvious reasons.

Welp, feel free to close this issue as we’ve obviously reached some possible solutions here, but I think that we have a conversation to have about exposing got’s guts (ha ha) in a manner that doesn’t require hacky solutions (or non-solutions) like this.

szmarczak commented 3 years ago

I'm not sure if we need to expose this but it seems like this issue could be fixed in Got?

I was thinking about forcing agents to use the lookup function provided via options.dnsLookup. Given this some thought this is not a good idea, as the agent may implement its own logic in regards to DNS.

Setting up a custom DNS server seems to be the easiest solution. I recommend https://coredns.io/

JaneJeon commented 3 years ago

Setting up a custom DNS server seems to be the easiest solution.

Ah, I was thinking of setting up OS-level DNS caching, but I’m guessing what you’re suggesting is to setup an external DNS server to 1. basically proxy to an ACTUAL dns server like 1.1.1.1/8.8.8.8, 2. keep a cache of lookups/rejections? @szmarczak

szmarczak commented 3 years ago

Correct. Alternatively you can set up a Got handler, attach a request listener on the promise/stream, on that request attach a socket listener and check the IP address there.

JaneJeon commented 3 years ago

Okay, I’ll close this for now, but it still would be good to have access to options.dnsLookup || options.dnsCache[‘lookup’] (what got uses internally), but that’s more of a got issue.