etix / mirrorbits

Mirrorbits is a geographical download redirector written in Go for distributing files efficiently across a set of mirrors.
MIT License
503 stars 91 forks source link

http: avoid counting very close range downloads from a same source. #128

Closed Jehan closed 5 months ago

Jehan commented 2 years ago

The idea is that if a same machine downloads twice ranges of a given file within a very short time span, it's probably 1 download (but done in pieces). For GIMP, we have particularly such use cases as we also serve torrent files and use our mirrors as web seeds. So a same torrent client may end up requesting partial bits of files.

I'm using Redis feature of temporary keys and went with 10 minutes for expiration. Each time a new piece is requested, the timer is reset. If 2 requests are done with a longer silence frame, let's assume these are different downloads. The logic is worth what it is, but better than just re-counting partial downloads within seconds of intervals.

Finally I identify source clients though their (IP, user-agent), hashing this ID for privacy and not storing non-necessary private data.


Note: we started to use Mirrorbits for GIMP a few days ago: https://download.gimp.org/pub/gimp/v2.10/gimp-2.10.32.tar.bz2?mirrorstats

I was wondering if stats were taking range downloads into consideration. Looking at the code, apparently it's not.

In GIMP, we also serve torrents file and we use our mirrors as web seeds, which means we may have torrent clients sending "GET" requests asking for a "Range". A same client may typically request several ranges to the http server, within minutes, if not seconds. Right now, each request counts as 1 download, which therefore over-estimate the download count.

Such a logic as proposed is not perfect (perfect is likely not possible here anyway), but it would get much closer to real download count.

lazka commented 2 years ago

Wouldn't it be easier (or good enough) to just count requests that have a range starting with 0? Though I don't know how easy it is to parse that header or if golang provides some helpers for that.

Jehan commented 2 years ago

Wouldn't it be easier (or good enough) to just count requests that have a range starting with 0?

I don't really understand what you mean, or what you are proposing to solve with this specific change? :-)

In any case, no, adding a specific casing for range starting with 0 is not "easier" since it would only add a test (so it would just be more code, though I would not call this more "difficult" per se, but easier, obviously not). Yet once again, I'm not sure what you are trying to achieve with this additional test.

Note: there are improvements to be done by the way by checking ranges. Such as improving the transfer data stats. But none of the changes I have in mind cares about whether or not we start at 0, which is just a particular case (which is not so special).

Jehan commented 2 years ago

Ok so trying to guess what you might have meant, it is that you assume that we'd have exactly 1 GET call with a range starting at 0 per download. So you think that we can just count each GET with a 0-range == 1 download (and ignore the rest). This is a false assumption for a number of reasons:

  1. First there are the cases where a single download could request two 0-starting ranges. Software which request ranges of download usually handles download corruption (torrent clients do for instance). So yes, it could happen. I agree this one case is not necessarily the most common.
  2. But even more, a downloader could request zero 0-starting ranges! At least in our particular case of handling torrent downloads, a single torrent client will request separate ranges to separate seeders. It could request the range 0-10000 to a random seeder and the range 10000-20000 to the download server (acting as web seeder). With your proposed logic, this download will never be counted even though it's clearly a proper download. So we'd get from the situation where we have too many downloads counted (currently) to the one where we ignore many downloads.

My proposed logic instead is just about kinda counting the unique downloads with "unique" being defined as a single (file, downloader IP, downloader user-agent) triplet within a close time range. I don't think we can get much more accurate than this (I mean: we probably can, but the added complexity is likely not worth it), because there are too many possible cases and we can't know what is really happening (i.e.: why is a specific client requesting a range?).

lazka commented 2 years ago

I see, never mind then. I wasn't aware of how web seeds work.

(another thing I could think of is counting as float with count+=range/total)

ott commented 2 years ago

The changes seem logically correct to me: Range requests are only counted once within a time frame. However, the efficiency of the Redis operations could be doubled and some minor stylistic changes would be advisable. Otherwise, this looks good to me.

jbkempf commented 2 years ago

LGTM

ott commented 2 years ago

@Jehan Another idea would be to count the requested amount of data, to sum it up and finally to divide the sum by the file size. I can explain it in more detail but as far as I know this method is used to estimate the number of podcast downloads.

Jehan commented 2 years ago

@ott

However, the efficiency of the Redis operations could be doubled

I assume you mean that we can do the GET and SET in the same time. So I looked at redis commands and saw the GETSET command, which is now deprecated in favor of SET with the GET option, since redis 6.2.0. But since current minimum requirement is redis 3.2.0, I thought it would be a bad idea to bump this requirement so high.

Using the deprecated GETSET (which will probably stay around quite some time) is also not a great idea because it doesn't have the expiration options. So instead I added a commit (eed732034f548fa70b880a1a164178d88eec4ce9) to check the redis server version at connection and store this data (if we request it each time, it makes no sense if the goal was to have less redis requests!). Then I use this info (through a new h.redis.IsAtLeastVersion() function) before getting the temporary key. This way, it works in a single request on newer redis servers while still working on older redis server though with 2 requests.

some minor stylistic changes would be advisable

You didn't tell which ones exactly even though I took care to reuse the same style as I was seeing in nearby code. I only found one place where I had a space between function name and opening parenthese. I removed it and amended directly in the first commit. If there is anything else, I'd welcome the details because I seem to follow existing code style.

Another idea would be to count the requested amount of data, to sum it up and finally to divide the sum by the file size. I can explain it in more detail but as far as I know this method is used to estimate the number of podcast downloads.

This won't do, for the same reasons I gave earlier to @lazka. In a situation where downloads are full only, it makes sense. But with partial downloads, the total amount of data divided by the file size doesn't mean the number of downloads at all.

A torrent downloader could perfectly download only a part of the file on our server while downloading the rest on other seeding machines. With your idea, if 2 completely different users (different IP, different machines, even days apart!) were to download half of the file, using your proposed logic, we'd count it as a single download. Which obviously makes absolutely no sense here.

Now I agree that a further improvement would be to sum up the exact requested amount of data (this is what I meant in an earlier comment when saying some improvements can be made) and I will propose to look at this in a later MR (we kind of diverged already a bit now in this MR, where I also added some logic for more efficient redis version checks). In any case, this data must absolutely not be used to count the number of downloads which is completely different. 2 partial downloads by 2 people are still 2 downloads. We only try to gather the small partial downloads by a single person into a single bigger partial download, not gather partial downloads from completely different sources (which would be the consequence on basing download counts on size of transferred data).

Jehan commented 2 years ago

For the record, the reason why I propose this patch is that for instance, in GIMP case, mirrorbits currently shows more 26 millions of downloads just for our current stable version of the Windows installer, in less than a month of usage (we started using mirrorbits on August 29).

Though I'd love to pretend that's true, and we definitely know for a fact that GIMP is massively downloaded anyway, we also know that these numbers are over-evaluated because of the torrents partial GET requests.

So unfortunately currently, mirrorbits stats are not too usable for the binary files (Windows installer and macOS dmg) which are also provided officially with a torrent file. I'd love to get a new release of mirrorbits where we'd have more usable stats. :-D

Jehan commented 1 year ago

Any chances to see this merged? I'd love for us to have reliable download stats! :-D

Jehan commented 11 months ago

@jbkempf Rebased and force-pushed. :-)

garfvl commented 11 months ago

Hello @Jehan, I am Simon, part of the VLC mirrors management team.

This looks like a very nicely pragmatic way of detecting partial downloads and avoid wrong stats gathering! I like the idea.

I have a few questions though:

jbkempf commented 11 months ago

@jbkempf Rebased and force-pushed. :-)

And reviewed by @garfvl

Jehan commented 11 months ago

hardcoded 600 seconds seems very much related to your customs needs. Do you think it would be possible to add this variable in the mirrorbits configuration ? (even with 600 as a default value)

To be fair, I don't even think it's our custom needs. I wondered:

What could be a reasonable time between which we'd consider that if a client sends 2 range requests, we may think "it's probably the same download"?

I came up with 10 minutes as I could have come with 2 minutes or 30 minutes. In the end, all of this is just some approximate heuristic because we can't know the real answer to what is in fact the real question: "are these 2 range requests for a single download or 2 separate downloads?"

In reality we can theoretically also have 2 clients with a same IP and a same user-agent requesting the same file 2 times in a row with a few seconds delay, same as a single client can make a new request for a range after an hour (someone behind a very slow download connection, or maybe a torrent client "fixing" a broken file afterwards). So it was really just an OK duration which felt reasonable to me to get the most common use-case.

Anyway sure, that could definitely be a configuration, I agree. We could imagine that some people could even want to set the value to 0 meaning that every range download is still to be considered by a full download on its own (as of now). I believe it's a bit cheating the numbers up, but whatever…

I'll have a look at this, but probably not this week. I'm preparing a stable GIMP release so this will wait a bit.

I may be completely wrong on this one, but adding a sha256 call on pretty much every request received seems to add some non-negligible CPU usage.

First, it's not every request, but every range request. I think most requests are still non-range requests anyway. But once in a while, you can have range requests. Not just the torrent case, but nowadays I believe that most browsers have the ability to continue a download which was stopped in the middle.

This being said, you may be right that it's still maybe more CPU requirement that we may want (even though the data to hash is quite small). To this question:

Did you get the chance to perform some benchmark on it ?

No I have not done any benchmark. I may try later too. Our most downloaded file is likely whatever is our latest Windows installer at point t in time, which seems to get about 500k requests a day (according to mirrobits stats command). Of course not all of these are range requests, so we would not have 500k SHA256 checksum computed a day, but it would still be interesting to verify how heavy this would be.

Note that we could also use MD5 for instance which would be must faster and less CPU intensive. I went with the assumption that we are hashing private data (IP and user-agent) so the less broken algorithm the better. On the other hand, this is very short-lived data, not the most sensitive data of all (this same data is actually even usually stored in web server logs in plain text) and the attack vector is probably very limited here. So an older and faster algorithm like MD5 might just be enough when weighing the benefits and costs.

Anyway I'll look into it too.

I know there was some debate in the golang community regarding the standard implementation performances of these hash functions in the past, but I had no chance to check the real impact recently...

I don't really know what went on the the golang community. This MR might have been the first and only time I ever looked at Go code (or maybe not, but I don't really remember). 😅

But on a wider perspective, disregarding the programming language used, it's indeed a good comment. Cost of any cryptographic algorithm can sometimes be not negligible at all when done a lot. You are right, it's better to check. I should have tested this.

garfvl commented 11 months ago

In reality we can theoretically also have 2 clients with a same IP and a same user-agent requesting the same file 2 times in a row with a few seconds delay, same as a single client can make a new request for a range after an hour (someone behind a very slow download connection, or maybe a torrent client "fixing" a broken file afterwards). So it was really just an OK duration which felt reasonable to me to get the most common use-case.

Anyway sure, that could definitely be a configuration, I agree. We could imagine that some people could even want to set the value to 0 meaning that every range download is still to be considered by a full download on its own (as of now). I believe it's a bit cheating the numbers up, but whatever…

The delay is clearly related to how the user is supposed to download the package, indeed.

Thinking again about the problematic, I really hope this will not "cheat down" the numbers if a lot of users try to download files behind a NAT - like pretty much any university, company, etc. I still have some trouble to determine if the agent is distinctive enough to discriminate range requests with the same IP.

I'll have a look at this, but probably not this week. I'm preparing a stable GIMP release so this will wait a bit.

I may be completely wrong on this one, but adding a sha256 call on pretty much every request received seems to add some non-negligible CPU usage.

First, it's not every request, but every range request. I think most requests are still non-range requests anyway. But once in a while, you can have range requests. Not just the torrent case, but nowadays I believe that most browsers have the ability to continue a download which was stopped in the middle.

My bad for the confusion, yes, I was talking about ranged requests. And correct, pausing/resuming download is common now.

Note that we could also use MD5 for instance which would be must faster and less CPU intensive. I went with the assumption that we are hashing private data (IP and user-agent) so the less broken algorithm the better. On the other hand, this is very short-lived data, not the most sensitive data of all (this same data is actually even usually stored in web server logs in plain text) and the attack vector is probably very limited here. So an older and faster algorithm like MD5 might just be enough when weighing the benefits and costs.

I ended up the same conclusion: MD5 or SHA1 collision is probably not a big issue on this particular topic, and well optimized nowadays. Of course, if the CPU usage difference is negligible, there should be no reason to use something else that a more "modern" hash function like SHA256 - which is why I asked the question.

But on a wider perspective, disregarding the programming language used, it's indeed a good comment. Cost of any cryptographic algorithm can sometimes be not negligible at all when done a lot. You are right, it's better to check. I should have tested this.

Well, I also asked because in other projects I experienced some similar issues in term of scaling... I can probably try to check on my side as well, this should be fast (to check).

garfvl commented 11 months ago

Ok, I think you can forget about my concerns about the hash functions, I just done a quick check on my laptop (Intel i7-7700HQ): with something very similar to your code (hashing a concat of an IPv6 and a user-agent), I ended up something like:

Let's keep the sha256 then :D

Jehan commented 11 months ago

Thinking again about the problematic, I really hope this will not "cheat down" the numbers if a lot of users try to download files behind a NAT - like pretty much any university, company, etc.

If you are thinking about things like a company/univ massively installing a software on all their computers/servers, I sure hope that they don't re-download the program every time. I.e. they download it once and make it available on an internal server which will make the download part a lot faster/more reliable. As far as I know, that's often what univs/companies really do, with an internal server containing all the "allowed software".

And even if they did massively download the software from the source server, they would likely just use a simple wget/curl through a script, which would not do a range request anyway (these tools are likely able to do range requests, but that's not how you typically set them up by default for a script). I believe that range requests are mostly happening either in specific peer-to-peer protocols such as torrent, or within interactive download interactions. So even with this use case in mind, the numbers would not get caught by this part of the code.

I'm not saying that the counting error cannot happen in the "cheating down" direction, but the numbers are so wrong right now for us in the "cheating up" side (to the point of being useless), and while proposing this patch, I have thought a lot about how the heuristic could break the other way around (I did imagine the "massively installing a software use case in big organizations" too and came to the conclusion that it won't be a problem). In any case, I'm pretty sure we are all aware that these stats help having ballpark statistics and that's it. Most FLOSS are in fact already downloaded through dozens of other channels (Linux distributions, various software-listing websites, "stores"…) and we know that 1 download doesn't really equate 1 user (some people re-download the software for random reasons; and the use case you note with big organizations is typical of a use-once-install-many use case, already, with or without this code change), and so on. If people need an exact count on users, they have other means (most of which we don't agree with and won't do because they imply often tracking users or going for proprietary licenses).

So yeah, bottom line, I believe that this added heuristic will fix a lot the "cheating up" part but won't really make MirrorBits "cheat down" in any meaningful way.

Let's keep the sha256 then :D

Ok. So I add the delay as a settings in MirrorBits configuration and we are good then?

I'll try to do this within the next 2 weeks. Really looking forward the day when we'll have finally meaningful stats! :-)

garfvl commented 11 months ago

Thinking again about the problematic, I really hope this will not "cheat down" the numbers if a lot of users try to download files behind a NAT - like pretty much any university, company, etc.

If you are thinking about things like a company/univ massively installing a software on all their computers/servers, I sure hope that they don't re-download the program every time. I.e. they download it once and make it available on an internal server which will make the download part a lot faster/more reliable. As far as I know, that's often what univs/companies really do, with an internal server containing all the "allowed software".

Very true for companies.

And even if they did massively download the software from the source server, they would likely just use a simple wget/curl through a script, which would not do a range request anyway (these tools are likely able to do range requests, but that's not how you typically set them up by default for a script). I believe that range requests are mostly happening either in specific peer-to-peer protocols such as torrent, or within interactive download interactions. So even with this use case in mind, the numbers would not get caught by this part of the code.

I'm not saying that the counting error cannot happen in the "cheating down" direction, but the numbers are so wrong right now for us in the "cheating up" side (to the point of being useless), and while proposing this patch, I have thought a lot about how the heuristic could break the other way around (I did imagine the "massively installing a software use case in big organizations" too and came to the conclusion that it won't be a problem). In any case, I'm pretty sure we are all aware that these stats help having ballpark statistics and that's it. Most FLOSS are in fact already downloaded through dozens of other channels (Linux distributions, various software-listing websites, "stores"…) and we know that 1 download doesn't really equate 1 user (some people re-download the software for random reasons; and the use case you note with big organizations is typical of a use-once-install-many use case, already, with or without this code change), and so on. If people need an exact count on users, they have other means (most of which we don't agree with and won't do because they imply often tracking users or going for proprietary licenses).

So yeah, bottom line, I believe that this added heuristic will fix a lot the "cheating up" part but won't really make MirrorBits "cheat down" in any meaningful way.

Agreed.

Let's keep the sha256 then :D

Ok. So I add the delay as a settings in MirrorBits configuration and we are good then?

I'll try to do this within the next 2 weeks. Really looking forward the day when we'll have finally meaningful stats! :-)

Yes! Absolutely.

jbkempf commented 11 months ago

So, we’re waiting for a configuration patch. :)

elboulangero commented 10 months ago

@Jehan bouncing back on what you said:

First, it's not every request, but every range request. I think most requests are still non-range requests anyway. But once in a while, you can have range requests. Not just the torrent case, but nowadays I believe that most browsers have the ability to continue a download which was stopped in the middle.

I would add download accelerators. People use it to speed up big downloads, especially in countries where internet connection is slow or unreliable.

I think most requests are still non-range requests anyway. But once in a while, you can have range requests.

From what we see at Kali Linux images (http://cdimage.kali.org/), looking at the last 3 months of logs, around 66% of the requests are range requests, so 2/3. The majority of the traffic is made of range requests.

I don't know if you can measure that easily on your side. My guess is that this ratio depends on how big the files. A Kali image is around 4 GB, so users are more likely to use download accelerators.

I also wonder a bit how this change might affect performance, but it's easy to patch it out if needed. We'll see.

Jehan commented 10 months ago

I don't know if you can measure that easily on your side.

Unfortunately right now, we can't because we don't have access to logs. We have access to a stats tool called "Splunk" but it doesn't contain the "range" information. GNOME admins said they could have a look eventually to configure them in, but it's very low in their priority list. Anyway in our case, we are pretty sure we have a lot of range downloads (in our case, because we use our download servers as web seeds for torrents at the very least, though maybe even without this, it could be the case? I don't know).

I also wonder a bit how this change might affect performance, but it's easy to patch it out if needed. We'll see.

With the settings I'm going to add (when I'll have time), you will be able to disable this by setting 0 as check delay. No need to patch the change out.

Though it would still be interesting to know if anyone has performance issue. Moving to using MD5 is still a possibility. @garfvl made tests earlier where they established that SHA256 was fast enough, but MD5 is still more than 25% faster (which can make a difference when we are talking hundreds of thousands of calls).

Jehan commented 9 months ago

Sorry for the delay. I'll try to work on adding the missing settings soon(ish). :-)

Jehan commented 5 months ago

Ok, sorry again for the delay, I had quite some other stuff to handle! Anyway I implemented the new settings (cf. the last commit). I hope it's fine. I would appreciate a review, and hopefully a merge.

Our admins from the GNOME Foundation infrastructure said they'd add this change as soon as it's validated upstream. :-)

Thanks!

jbkempf commented 5 months ago

And merged. If needed extra work can be done, before release.

This needs documentation updates, btw.

Jehan commented 5 months ago

Thanks!

As for docs, I can have a look but where should I look into? If I grep various other settings, I see them nowhere except in mirrorbits.conf, documented through a comment and showing the default value. So that's why I just did the same in my last commit (I added the new option in this conf file, with a comment explaining what it is used for and showing the default value too).

garfvl commented 5 months ago

@Jehan,

the only "canonical" ways of documenting on this project are the mirrorbits.conf and the Wiki, indeed.

The description on the configuration file looks very nice, the only missing part would be to describe how to disable it (by setting it to zero if I understand your code).

elboulangero commented 5 months ago

Running in production at http://cdimage.kali.org !

Sharing two screenshots, one from 9th of April (without this change), and one on the 10th, with the change. Same time of the day (I believe that mirrorbits show stats for the current UTC day? I'm not even 100% sure).

Anyway, we clearly see the decrease in downloads stats. The patch is working!

I also didn't notice any increase in CPU usage after the change. Thanks @Jehan !


Before change:

kali-images-2024-04-09

After change:

kali-images-2024-04-10

Jehan commented 5 months ago

Thanks @elboulangero. Unlike you, I was a bit disappointed now that GNOME admins updated MirrorBits for GIMP downloads. Yesterday we had a full day with the new code, but the latest installer of GIMP still had 2.6 million downloads according to MirrorBits. And today we already passed the 2 million downloads though we still have a quarter of the day left (server time is UTC).

Compared to previous days, it didn't get down much.

I looked further in the code and realized that MirrorBits considers any request as a download, even HEAD requests (which are probably common since it's a good way to check the size of a file without starting a download?). So I opened #171. Hopefully this might be another good reason for the over-evaluated download stats we have?