7x11x13 / free-bandcamp-downloader

Get free/name your price music from Bandcamp in lossless quality
https://pypi.org/project/free-bandcamp-downloader/
21 stars 5 forks source link

issues/questions about resolving custom domains #6

Closed yoshiyoshyosh closed 4 months ago

yoshiyoshyosh commented 7 months ago

so, bandcamp allows people to have a custom domain that redirects to their bandcamp. it's not just a simple 301 redirect from the domain to their site and everything else is the same. it's mostly like that, except for the fact that artwork link in album download pages link to the custom domain, not the *.bandcamp.com domain.

to see how this affects the script, consider this album: https://generalmumble.bandcamp.com/album/blimp-fortress despite providing the *.bandcamp.com link, the album_url that gets stored in downloaded.txt is https://mumbleetc.com/* rather than the bandcamp link. if, in the future, the custom domain stops working / gets disabled, this will cause the album to get re-downloaded and a "stale link" to be left in the downloaded.txt file, which could result in something especially bad if a different bandcamp account snags the same custom domain (while unlikely, it is possible)

for albums that require email, it just causes the script to crash if you provide it the *.bandcamp.com link, which is what one would usually provide since it's what is redirected to automatically:

$ ~/.local/python-venv/bin/bcdl-free -a https://generalmumble.bandcamp.com/album/i-n-f-i-n-i-t-e-s-u-n-s-e-t-l-o-o-p -f FLAC --no-unzip -e auto -z 12345
INFO:httpx:HTTP Request: GET https://www.1secmail.com/api/v1/?action=getDomainList "HTTP/1.1 200 OK"
INFO:free_bandcamp_downloader:https://generalmumble.bandcamp.com/album/i-n-f-i-n-i-t-e-s-u-n-s-e-t-l-o-o-p requires email
INFO:free_bandcamp_downloader:Waiting for 1 emails from Bandcamp...
INFO:httpx:HTTP Request: GET https://www.1secmail.com/api/v1/?action=getMessages&login=m1lrox33ivkt&domain=1secmail.com "HTTP/1.1 200 OK"
INFO:free_bandcamp_downloader:Received email "Your download from Mumble Etc."
INFO:httpx:HTTP Request: GET https://www.1secmail.com/api/v1/?action=readMessage&login=m1lrox33ivkt&domain=1secmail.com&id=1278094010 "HTTP/1.1 200 OK"
Traceback (most recent call last):
  File "/home/yosh/.local/python-venv/bin/bcdl-free", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/yosh/.local/python-venv/lib/python3.12/site-packages/free_bandcamp_downloader/__main__.py", line 417, in main
    downloader.wait_for_email_downloads()
  File "/home/yosh/.local/python-venv/lib/python3.12/site-packages/free_bandcamp_downloader/__main__.py", line 306, in wait_for_email_downloads
    album_url = self._download_file(
                ^^^^^^^^^^^^^^^^^^^^
  File "/home/yosh/.local/python-venv/lib/python3.12/site-packages/free_bandcamp_downloader/__main__.py", line 143, in _download_file
    album_data = self.mail_album_data[album_url]
                 ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'https://mumbleetc.com/album/i-n-f-i-n-i-t-e-s-u-n-s-e-t-l-o-o-p'

there's a few ways this can be resolved, which is why I made an issue to discuss rather than immediately starting with a pr:

  1. resolve the *.bandcamp.com domain no matter if you give it a custom domain or not, and on album download pages, trade out the custom domain for said *.bandcamp.com domain before doing anything with it
    • I like this approach the most. it keeps everything consistent in downloaded.txt as well as future-proof the script in case a custom domain stops working. however, it also would cause current links in downloaded.txt to be stale and have redundancy, which I guess isn't too bad if it means keeping stuff for the future
  2. store both the *.bandcamp.com and custom domains in downloaded.txt, and somehow resolve issues like that
  3. store the custom domain whenever possible, and always resolve to the custom domain if given a *.bandcamp.com link. the problem with this approach is that while it's easy to get the *.bandcamp.com domain from a custom domain, the only place I can see to get the custom domain from the bandcamp one is on the download page, so it's probably cause some weird stuff to fix

even if I like the first approach the most, in any case, I'd like to hear your input

7x11x13 commented 7 months ago

I think the current downloaded.txt format is actually really silly, we should really be storing the downloaded albums by their ids, not the URLs (as you mentioned, custom domains may stop working, in addition, if someone changes their subdomain of bandcamp.com it will also stop working). I think we should change it to something like downloaded.csv with a comma separated list of album ids. For migration, we can check to see if downloaded.txt exists, and if it does, create the new file based on the old one, and delete the old one (however, I'm not sure if there is an efficient way to get album ids based on URLs... it may take a while. Maybe we could do some kind of 'lazy migration' where it only migrates n urls on each launch, or add a migration script and issue a warning if downloaded.txt still exists).

Similarly, the mail_album_data dict should use tralbum ids as keys instead of the url...

7x11x13 commented 7 months ago

A straightforward way to get the track/album id is to look at the meta tag bc-page-properties on the track/album page. An example for the album you linked: <meta name="bc-page-properties" content="{'item_type':'a','item_id':1020132016,'tralbum_page_version':0}"> We can use a combination of the item_type (t for track and a for album) and item_id (just concatenate them) as our item id, since I'm not sure if item_ids are unique regardless of item_type.

However it would be nice if we can get these values without having to load the whole webpage.

7x11x13 commented 4 months ago

Should be fixed in v0.2.1, I realized we can actually just keep the current downloaded.txt file but add album IDs instead of URLs from now on and check both URL and ID to see if it's in the file already