easlice / bandcamp-downloader

Download your bandcamp collection using this python script.
MIT License
282 stars 34 forks source link

Pulling links from the hidden items section #5

Closed 8N7D2o closed 4 months ago

8N7D2o commented 1 year ago

When I try to download my collection, it also pull some links from the hidden section.

my collection_count should only be 20 so that means it should only download 20, right?

collection

But it's downloading 40 links for some reason.

downloader

I know that hiding items somewhat works because before hiding the downloader was pulling 170 items.

hidden

easlice commented 1 year ago

Thanks for the report. This is kind of odd.

I know it has been a while since you reported this, but did you also see 40 albums after it ran? If so, where the extra ones from your hidden items?

Would you be willing to share logs (with personally identifying information removes, of course) from a run with very verbose logging (-vvv)?

8N7D2o commented 1 year ago
pulling_from_hidden

The downloads with individual files (the ones that ends in .m4a) are pulled from the hidden section. The others that ends in .zip are the ones that are not hidden. I hid any solo songs (150 in total) in my account and left the purchased albums as normal (20 in total).

easlice commented 1 year ago

So, this scripts works by scraping the json data used by the website. I wonder if it still ends up pulling all entries for an artist if it is mixed hidden and not hidden? Just a wild speculation.

It looks like 40 is the correct number, so it is reporting that correctly, just that some of those should be hidden and some should not be (and some hidden files are not showing up at all). Maybe I can try hiding some files on my account and see if there is something in the web data that we can use to determine if files that show up are 'hidden' or not.

I need to research some more on this.

In the meantime, there have been a few changes to the script and how it identifies some things. I don't necessarily expect it to fix this, but who knows, it might.

cubicvoid commented 7 months ago

I came here to report a similar error, I have more information on this issue (and a potential fix):

On my account I have 724 public items and a couple dozen hidden items. The downloader correctly recognized and downloaded 724 items; so far so good, though if I'd thought about it I would have preferred to include hidden items in my archive.

What I discovered later is that even though the number matched, the downloaded albums included some hidden items, and some of my non-hidden items were missing.

I checked the raw contents of the data blob bandcamp-downloader is using, and it seems like this is a "bug" in bandcamp itself. I put bug in quotes because it seems like bandcamp never actually uses the data blob, so the fact that some items are missing never matters, except to hapless python scripters who were trying to archive bandcamp libraries using something that looked like complete library metadata.

The workaround is inconvenient but seems effective: use the API endpoint instead of the public URL. Here is how I fetched my whole library:

```sh
curl 'https://bandcamp.com/api/fancollection/1/collection_items' --compressed -X POST \
    -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/119.0' \
    -H 'Accept: */*' \
    -H 'Accept-Language: en-US,en;q=0.5' \
    -H 'Accept-Encoding: gzip, deflate, br' \
    -H 'Referer: https://bandcamp.com/USERNAME' \
    -H 'Content-Type: application/json' \
    -H 'X-Requested-With: XMLHttpRequest' \
    -H 'Origin: https://bandcamp.com' \
    -H 'Connection: keep-alive' \
    -H 'Cookie: [REDACTED]' \
    -H 'Sec-Fetch-Dest: empty' \
    -H 'Sec-Fetch-Mode: cors' \
    -H 'Sec-Fetch-Site: same-origin' \
    --data-raw '{"fan_id":184803,"older_than_token":"2097579887::a::","count":1000}' \
    >full-library.txt

replace USERNAME as appropriate and fill in the proper cookies, and I suppose you could increase "count" for libraries bigger than 1000. Some of these headers are probably unnecessary, I didn't try to optimize. The only other subtlety is "older_than_token" which is effectively a unix timestamp, see this comment for details.

The resulting json contains everything including hidden items, though you could optionally still skip those by checking if the hidden field is set in each item before downloading.

cubicvoid commented 7 months ago

Actually one correction to the previous, the collection_items API seems to return metadata for everything, but it still only includes download links for unhidden items. (But unlike the currently used blob it seems -- for me -- to return the actual unhidden ones instead of an unpredictable mix.)

cubicvoid commented 7 months ago

Oh, my mistake, when I went in to try and fix it I realized you're using the tag attribute blob differently than I'd thought... so you already are basically doing what I was suggesting. I think the issue is probably this line:

'count' : _user_info['collection_count'] - len(_user_info['download_urls']),

You calculate the item request count based on the number of urls you expect, but what it returns is the full list of items (including hidden) truncated to that number, and then in redownload_urls a smaller list of the subset that are actually visible/downloadable.

So I guess the difficulty is to either figure out a better upper bound for count, or to fetch the list in multiple pages so count doesn't have to be changed.

cubicvoid commented 7 months ago

My explanation from yesterday was the right intuition but not quite right in the API specifics -- I put together a fix, see the linked PR description for more precise details of what was going on.