everywall / ladder

Selfhosted alternative to 12ft.io. and 1ft.io bypass paywalls with a proxy ladder and remove CORS headers from any URL
GNU General Public License v3.0
4.41k stars 175 forks source link

Add option to use random IPs for trusted bots #53

Closed dxbednarczyk closed 10 months ago

dxbednarczyk commented 10 months ago

Probably not the best way to read the response data, but I just wanted to make a proof of concept first. I would much rather not have type inference later down the line. Works fine in my initial testing.

dxbednarczyk commented 10 months ago

@deoxykev

dxbednarczyk commented 10 months ago

Might be a good idea to cache the IPs to disk somewhere (~/config/ladder?) I'll work on that later

deoxykev commented 10 months ago

Some feedback:

First, nice work on the initial draft. It's a good start. If you don't mind, I have some feedback on how to make the code cleaner.

So first is the design pattern called dependency injection.

Here's what an basic instantiation of a proxychain looks like:

proxychain.
    NewProxyChain().
    SetFiberCtx(c).
    SetRequestModifications(
        rx.MasqueradeAsGoogleBot(),
        rx.ForwardRequestHeaders(),
        rx.SpoofReferrerFromGoogleSearch(),
    ).
    AddResponseModifications(
        tx.DeleteIncomingCookies(),
        tx.RewriteHTMLResourceURLs(),
    ).
    Execute()

The Random IP pool for MasqueradeAsGoogleBot() ideally should only be updated once right? If we update it inside of the MasqueradeAsGoogleBot() request modifier, then the Googlebot IP pool will be refreshed every single time a new URL is proxied.

The same goes for loading any other resource, such as the rulesets. Only load it once. Since this is an optional feature, we can "inject" it into MasqueradeAsGoogleBot() instead, so that we can create the Googlebot IP pool from the outside.

_ := helpers.UpdateGoogleBotIPs()

...
SetRequestModifications(
    rx.MasqueradeAsGoogleBot(helpers.RandomGoogleIP),
    rx.ForwardRequestHeaders(),
    rx.SpoofReferrerFromGoogleSearch(),
).
...

In order to do this, you'd have to modify the signature for MasqueradeAsGoogleBot()

// old signature
func MasqueradeAsGoogleBot() proxychain.RequestModification {
// new signature
// randIP is a parameter of type function. This function returns a string.
func MasqueradeAsGoogleBot(randIP func() string) proxychain.RequestModification {

This idea is called "dependency injection".

By changing the function to accept a function that provides a random IP, you've decoupled the IP generation logic from the proxychain configuration.

This allows for more flexible and testable code, as well as good separation of concerns, an important concept when working with other people. (they don't need to know exactly how it's implemented in order to use it.)


But there are multiple MasqueradeAs*Bot functions, all with their own IP pool, and perhaps different implementations. However, we will use it in a very similar manner. If we wanted to decouple the implementation from the downstream usage, we could use an interface:

type IPPool interface {
    // GetRandom gets a random IP
    GetRandom() string
    // Update fetches a fresh set of IPs
    Update() error
}

I'll show you the advantage of doing it this way in just a moment, hang on.


Let's implement that interface.

First, notice the structure of the JSON in "https://developers.google.com/static/search/apis/ipranges/googlebot.json":

We have to deserialize this into an interface anyway, so we'll have to create a struct. Why not just make this struct implement our IPPool interface?

Protip: use https://mholt.github.io/json-to-go/ for this.

type GooglebotIPPool struct {
    CreationTime string `json:"creationTime"`
    Prefixes     []struct {
        Ipv6Prefix string `json:"ipv6Prefix,omitempty"`
        Ipv4Prefix string `json:"ipv4Prefix,omitempty"`
    } `json:"prefixes"`
}

This is a good example of encapsulation. No need for an intermediary struct!


Now we have our struct, we can implement the interface just by defining public methods on it.

func (g *GooglebotIPPool) Update() error {
func (g *GooglebotIPPool) GetRandom() string {

Now to implement Update():

const googlebotAPI = "https://developers.google.com/static/search/apis/ipranges/googlebot.json"
func (g *GooglebotIPPool) Update() error {
    client := &http.Client{Timeout: 10 * time.Second}

    resp, err := client.Get(googlebotAPI)
    if err != nil {
        return err
    }

    if resp.StatusCode != http.StatusOK {
        return fmt.Errorf("GoogleBot IP Pool Update Error: Got %s from %s", resp.Status, googlebotAPI)
    }

    return json.NewDecoder(resp.Body).Decode(g)
}

Notice how we deserialize the JSON straight into the struct? We don't need type assertions, or an intermediary struct. Also I've omitted the timestamp parsing; we don't need that.


And to implement GetRandom():

It could look something like this:

func (g *GooglebotIPPool) GetRandom() string {
    if len(g.Prefixes) == 0 {
        return googlebotIPFallback
    }

    idx := rand.Int() % len(g.Prefixes)
    randCIDR := g.Prefixes[idx]

    switch {

    case randCIDR.Ipv4Prefix != "":
        randIPv4, err := randomIPv4FromCIDR(randCIDR.Ipv4Prefix)
        if err != nil {
            return googlebotIPFallback
        }
        return randIPv4

    case randCIDR.Ipv6Prefix != "":
        randIPv6, err := randomIPv6FromCIDR(randCIDR.Ipv6Prefix)
        if err != nil {
            return googlebotIPFallback
        }
        return randIPv6

    default:
        return googlebotIPFallback
    }
}

You can probably just pull a correct randomIPv4FromCIDR randomIPv6FromCIDR implementation from stackoverflow or chatgpt. IP addressing is very standard and has been solved many times before.

Don't worry about not being "smart" enough to understand the bit shuffling here, it's not important now. It's a level of abstraction that's not necessary to solve this problem. So approach it like cryptography-- don't roll your own. Use a library, or another implementation you find rather than re-inventing the wheel. Save your brainpower for other things.

func randomIPv4FromCIDR(cidr string) (string, error) {
    ip, ipnet, err := net.ParseCIDR(cidr)
    if err != nil {
        return "", err
    }

    maskSize, _ := ipnet.Mask.Size()
    numAddresses := 1 << (32 - maskSize)

    if numAddresses < 2 {
        return "", fmt.Errorf("invalid CIDR block size")
    }

    // Generating a random address, excluding the first and last address
    randomOffset := rand.Intn(numAddresses-2) + 1

    start := binary.BigEndian.Uint32(ip.To4())
    randomIP := start + uint32(randomOffset)

    resultIP := make(net.IP, 4)
    binary.BigEndian.PutUint32(resultIP, randomIP)
    return resultIP.String(), nil
}

func randomIPv6FromCIDR(cidr string) (string, error) {
    ip, ipnet, err := net.ParseCIDR(cidr)
    if err != nil {
        return "", err
    }
    //mask := new(big.Int).SetBytes(ipnet.Mask)
    start := new(big.Int).SetBytes(ip.To16())

    // Calculate the range of the network
    ones, bits := ipnet.Mask.Size()
    max := new(big.Int).Lsh(big.NewInt(1), uint(bits-ones))
    max.Sub(max, big.NewInt(2)) // minus network address and all-ones address

    // Generate a random address within the network range
    randInt := new(big.Int).Rand(rand.New(rand.NewSource(rand.Int63())), max)
    randIP := new(big.Int).Add(start, randInt)

    resultIP := make(net.IP, 16)
    copy(resultIP, randIP.Bytes())
    return resultIP.String(), nil
}

Ok, now that we have implemented IPPool for GoogleBot IPs, let's circle back to the interface we made earlier. If we modify the MasqueradeAsGoogleBot() function to instead accept the IPPool interface, we can pass in an instance of GooglebotIPPool.

type IPPool interface {
    GetRandom() string
    Update() string
}

func MasqueradeAsGoogleBot(ipPool IPPool) proxychain.RequestModification {
    const botUA string = "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; http://www.google.com/bot.html) Chrome/79.0.3945.120 Safari/537.36"

    botIP := ipPool.GetRandom()

    const ja3 string = "769,49195-49199-49196-49200-52393-52392-52244-52243-49161-49171-49162-49172-156-157-47-53-10,65281-0-23-35-13-5-18-16-11-10-21,29-23-24,0"
    return masqueradeAsTrustedBot(botUA, botIP, ja3)
}

Now, here is the magic of doing it by interface:

func MasqueradeAsYahooBot(ipPool IPPool) proxychain.RequestModification {
    const botUA string = "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; YahooBot/2.1; http://www.yahoo.com/bot.html) Chrome/79.0.3945.120 Safari/537.36"

    botIP := ipPool.GetRandom()

    const ja3 string = "789,47185-43199-49196-49200-52393-52392-52244-52243-49161-49171-49162-49172-156-157-47-53-10,65281-0-23-35-13-5-18-16-11-10-21,29-23-24,0"
    return masqueradeAsTrustedBot(botUA, botIP, ja3)
}

You can pass in another implementation of IPPool for the other MasqueradeAs*Bot(s) without changing the way it's used. Now it does not matter whether you pull the valid IPs from a random number generator, a local file, a database or a http API. The underlying consumer doesn't care.

All it cares about is that it gets a valid IP back.

This idea is called "interface segregation". By ensuring that any IPPool (regardless of source, yahoo, yandex, google) adheres to a common way of interacting with it (aka interface), we can make the modular and interchangeable. Another idea shown here is the open/closed principle. The system we've devised here is "open" for extension (you can make it cached, randomly generated, anything you'd like), but "closed" for modification. (we don't need to change the downstream request modifiers later)


Now, incorporating it all together:


func NewProxySiteHandler(opts *ProxyOptions) fiber.Handler {

    // create our IP pool, outside of the handler below.
    // this way, we only instaniate it once.
    ipPool := GooglebotIPPool{}
    ipPool.Update()

    return func(c *fiber.Ctx) error {
        proxychain := proxychain.
            NewProxyChain().
            SetFiberCtx(c).
            SetRequestModifications(
                rx.MasqueradeAsGoogleBot(ipPool), // <- "inject" our ipPool as a "dependency"
            ).
            AddResponseModifications(
                tx.ForwardResponseHeaders(),
            ).
            Execute()

        return proxychain
    }
}

Now, if you wanted to make the GooglebotIPPool.Update() method pull from a local cache, you could do that, without affecting other parts of the system.

I recommend using https://pkg.go.dev/os#UserCacheDir, to make your caching path cross-platform compatible, rather than hardcoding a directory.

Hope that helps you on your coding journey and thanks for the contribution!

dxbednarczyk commented 10 months ago

The Random IP pool for MasqueradeAsGoogleBot() ideally should only be updated once right?

The only thing that happens in that function is finding a random index from the list first instantiated in main.go. If --random-googlebot is passed as a flag, the list of valid IPs is retrieved and saved for the lifetime of that ladder instance, hence why I brought up caching them for a certain amount of time because the timestamp of when they were last updated is provided. I should probably also add that in the docker config.

func MasqueradeAsGoogleBot(randIP func() string) proxychain.RequestModification {

This seems, unnecessary? I guess you can just inject it as MasqueruadeAsGoogleBot(helpers.RandomGoogleBotIP()), but why wouldn't you just call it directly at that point? I think if there would be a change in how a random IP is accessed, it wouldn't be through a global variable but in some config struct.

type IPPool interface {
  // GetRandom gets a random IP
  GetRandom() string
  // Update fetches a fresh set of IPs
  Update() error
}

Good idea. I'll probably hunt down some other public-facing IP lists for Yahoo, DDG, etc. bots

I'm omitting the rest of my responses in this comment at least, considering yours was written up well enough to be self-explanatory (and almost like a whole presentation in and of itself).

I promise I'm not dumb, I just made that commit at midnight 😆 Just wanted to make the draft so I knew where to pick up what I left off

deoxykev commented 10 months ago

Hmm, thinking about it, it would be simpler to call it within the function, but it would also hide the details by having the IPPool being populated in a far away place (somewhere after CLI parsing).

However, the advantage of doing it that way would be that you could add and remove the MasqueruadeAsGoogleBot() response modifier dynamically, without having to make sure we have to create and pass in an IPPool, which isn't serializable. (might be important if we want to configure the response/request modifiers dynamically via API or via YAML config.)

I think to that end, we should nix the idea of dependency injection and just call it directly as you said.

dxbednarczyk commented 10 months ago

I think I'm going to stick with just Google and Bing for now, we can add support for others down the line eventually. The next thing is to find a way to cache the IPs, and use that pool if the timestamp didn't change. For now, this PR might be ready to merge after a final review