MTG / freesound

The Freesound website
https://freesound.org
GNU Affero General Public License v3.0
311 stars 87 forks source link

Store downloads when a user downloads a sound or pack more than once #840

Closed alastair closed 6 years ago

alastair commented 7 years ago

We currently have a unique index on (user, pack_id) or (user, sound_id) in the downloads table. This means that we only store one record of a user downloading something. We've decided that we want to store records of all downloads. To do this we need to:

We may also want to consider how we show the sound attribution page when a user downloads a sound twice, but the uploader has changed the license in the meantime. We might want to store the license in the Downloads table (perhaps uses too much data), or in a history table so that we can find the correct license that the user downloaded the sound under

bdejong commented 7 years ago

Perhaps it would make sense to break the pack downloads into individual sound downloads (I.e. a single pack download creates multiple sound download rows) and have a bool "as pack" in the list. Currently we don't keep track of the changes in packs. If a sound changes from one pack to another this could create weirdness if you'd analyse the pack downloads. That said this probably need some analysis to see how badly the table would explode in size.

On Apr 4, 2017 9:52 PM, "Alastair Porter" notifications@github.com wrote:

We currently have a unique index on (user, pack_id) or (user, sound_id) in the downloads table. This means that we only store one record of a user downloading something. We've decided that we want to store records of all downloads. To do this we need to:

  • Remove the unique indexes on the download table
  • Create a new Download record each time a user downloads a sound or pack
  • Check that all uses of the Download class treat these duplicates in a reasonable way
    • Users who downloaded a pack or sound
    • Attribution list
    • Sounds downloaded by user

We may also want to consider how we show the sound attribution page when a user downloads a sound twice, but the uploader has changed the license in the meantime. We might want to store the license in the Downloads table (perhaps uses too much data), or in a history table so that we can find the correct license that the user downloaded the sound under

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MTG/freesound/issues/840, or mute the thread https://github.com/notifications/unsubscribe-auth/AAxJlUBmxNkGr57JCTCCmhQFxSy2wG_Eks5rsp9igaJpZM4MzWDd .

ffont commented 7 years ago

Currently licenses can not be changed, however I think it is good to start thinking about it and I think its a good idea to add a "license" field in the download table. It's just a foreign key to the licenses table, don't think is a problem in terms of data. Another option is that if we allow license change at some point and store history of licenses for a sound, then we can guess by timestamps the license for a particular download. But still storing the license with the download entry will make future queries faster and easier (I guess).

Regarding pack downloads I agree with Bram, that could be a good solution. I'm also concerned about the size though. We have an open issue to discuss options for moving downloads outside of database (https://github.com/MTG/freesound/issues/585). I guess this is all related. The average number of sounds per pack is 15 (just computed it now), and pack downloads represent 2.5% of total number of downloads. Therefore, my estimation is that the size of the table would be increased by 35%.

This number is computed as: (num_total_pack_downloads*15 + num_total_sound_downloads) / num_total_downloads

ffont commented 7 years ago

For this issue we should simply remove the constraint which prevents us from storing repeated downloads of the same object-user pair.

Out of the discussion happening in this issue, we opened others:

ffont commented 6 years ago

This issue was closed because we implemented it in https://github.com/MTG/freesound/commit/6605289330a7b0b4d77a05800c94d5fb3014c70d and https://github.com/MTG/freesound/commit/91c1d206d1ec5b5b633c180eef1f18e54a393910. After that change, we used Download.objects.create(...) instead of Download.objects.get_or_create(...) both for sound and pack downloads. However this caused some problems and multiple download objects were created each time a user downloaded a sound. We hypothesise this could be because of download managers or range HTTP requests. We deployed a fix which reverts the functionality and only stores one download object per user-sound pair. This is the current code:

https://github.com/MTG/freesound/blob/4073e9430b93f201c7767eda4c24f659fd3eff3c/sounds/views.py#L309-L332

To close this issue we should investigate the problem of multiple download objects being stored for a single download and provide a fix for it. One way we thought we could fix that is to only store new download object if the latest one that exist is older than X time. This could be easily implemented by using if not Download.objects.filter(user=request.user, pack=pack, created__lt=<time threshold>).exists(): instead of if not Download.objects.filter(user=request.user, pack=pack).exists(): in here (and the same for the sounds case), but we should evaluate the impact of this in terms of performance.

andrebola commented 6 years ago

Maybe we could use an asynchronous task for checking if the download has been registered or not and create it.

ffont commented 6 years ago

The thing is we don't know the impact, so we should evaluate it and then make a decision. In any case first thing is to learn why the issue is happening, this might give us hints about different ways to fix it.

ffont commented 6 years ago

We've collected almost 100k download log messages in ~20 hours (see https://logserver.mtg.upf.edu/graylog/search?rangetype=absolute&fields=message%2Csource&width=1280&highlightMessage=&from=2017-11-30T15%3A37%3A24.000Z&to=2017-12-01T11%3A22%3A15.000Z&q=%22Download+sound%22). I stopped the logging now. With these messages we should be able to diagnose the problem and see if all occurrences of repeated sound_id/user_id messages are because of range HTTP requests. If that is the case, then we can implement the fix and only store in DB when range is 0 or there is no range (or something like that). Then we can deploy, log again and make sure the repeated download records problem disappeared.

andrebola commented 6 years ago

After looking this data I think there are two different situations in which we get multiple downloads but he don't want to store all of them. One of this is when the header range is present, which means that the download is resumed by the browser (or downloads manager). The other is when the user (accidentally) click on the download button 2 times. And this two cases could be combined (e.g filter by user_id:7169723)

As a solution I think we should ignore the downloads with the range header, but also I think we could keep a list of the last users that downloaded a sound and the session_id, so we don't store "accidental" downloads. In this way we would store downloads of users from different sessions.

ffont commented 6 years ago

I guess we could also disable the download button for some seconds to avoid the double click? Nevertheless, the data when filtering by user_id:7169723 is a bit weird because user seems to be always downloading the same sound (sound id 320117) and always two times in a row.

EDIT: this I understand now that are different request setting the range parameter. Still downloads appear to be doubled and I don't think everyone is doing double clicks (see comment below).

ffont commented 6 years ago

In fact looking closer at the data I see the log ALWAYS happens two times: screen shot 2017-12-04 at 10 17 25

Even with the range parameter, they are duplicated with the same range. This sounds like there is a problem with the logger or something we don't understand yet, does not look like double click issues...

alastair commented 6 years ago

We're not logging the http method - I wonder if the first one is OPTIONS, or something. Does this get passed to django too? Perhaps we could add this to the logging (this is why I wanted to add all of request) and check this as well.

The other possibility is that the double logs are happening because the logging system emits each log event twice, and the browser doesn't actually send 2 requests.

andrebola commented 6 years ago

We are logging the http method, they are all https.

I'm not sure why we see messages twice but note that they have always the same message id. So It's not creating multiple logs for the same event.

This is an example of two "repeated" messages: https://logserver.mtg.upf.edu/graylog/messages/downloads_0/b8c68541-d682-11e7-bd62-0242ac140004 https://logserver.mtg.upf.edu/graylog/messages/graylog_169/b8c68541-d682-11e7-bd62-0242ac140004 the only difference seems to be the index

ffont commented 6 years ago

Cool so I think now it is all quite clear:

I think what we should do is:

andrebola commented 6 years ago

Probably I did something wrong with the index. What I did was to create the index on System -> Indices. Then I created the stream and I set the Index to be the one I created.

ffont commented 6 years ago

Maybe there is an option to delete from original or something. Anyway, if this is too complex, we can just send to normal index.

andrebola commented 6 years ago

There is an option which was disabled : "Remove matches from 'All messages' stream" But if I change this it will not have effect on the old messages. So we should try again.

ffont commented 6 years ago

Check this box and for the next time we log it will be fine.

andrebola commented 6 years ago

I think that if range=0 it also has a previous log without range, see the case of user_id:8056289 AND sound_id:52293

ffont commented 6 years ago

I think that if range=0 it also has a previous log without range, see the case of user_id:8056289 AND sound_id:52293

That seems to be right, not sure if it depends on the client though.

alastair commented 6 years ago

We are logging the http method, they are all https.

By method I mean GET/HEAD/OPTIONS/POST, etc.

I think that if range=0 it also has a previous log without range,

https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests suggests that the first request may be a HEAD. We should add logging for the http method to confirm this, however I agree with @ffont that we're not sure if it's standardised that browsers first do a HEAD to get the size and then send a GET, or if they can immediately do a GET with a range header.

andrebola commented 6 years ago

After looking the new logs, it's clear that we need to store the downloads when the range header is not present. Sometimes there are still multiple entries for the same user and sound, probably because the user downloaded the sound many times in a few seconds and we want to prevent of creating multiple Downloads entries in this case as well. For doing this we should store the Download items in the last hour or so, then check if it's there before creating the entries in the database.

For storing temporal data we thought about using redis or memcache, if we dicide to use redis then we should switch also for the django cache.

alastair commented 6 years ago

If we use redis we need to work out how to make it work with multiple appservers - can we configure two redis servers that keep each other up-to-date, or is it better to have a single "redis server" that everything connects to?

ffont commented 6 years ago

I'd just use memcache. The worst thing can happen if memcache fails is that we store some duplicated download entries in case of double-clicks and things like that.