urllib.error.HTTPError: HTTP Error 403: Forbidden

Guillaume-oso commented 3 years ago

This code:

from sclib import SoundcloudAPI

api = SoundcloudAPI()

playlist = api.resolve("https://soundcloud.com/demangio/sets/tekno")

throw:

Traceback (most recent call last): File "test.py", line 5, in playlist = api.resolve("https://soundcloud.com/demangio/sets/tekno") File "/home/guillaume/.local/lib/python3.8/site-packages/sclib/sync.py", line 59, in resolve self.get_credentials() File "/home/guillaume/.local/lib/python3.8/site-packages/sclib/sync.py", line 54, in get_credentials js_text = f'{get_page(script)}' File "/home/guillaume/.local/lib/python3.8/site-packages/sclib/sync.py", line 14, in get_page return get_url(url).decode('utf-8') File "/home/guillaume/.local/lib/python3.8/site-packages/sclib/sync.py", line 11, in get_url return urlopen(url).read() File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.8/urllib/request.py", line 531, in open response = meth(req, response) File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response response = self.parent.error( File "/usr/lib/python3.8/urllib/request.py", line 569, in error return self._call_chain(args) File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(args) File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden

DJoepie commented 3 years ago

I have a really disgusting fix you can try, I don't want to make a PR for it because it's just a dumb fix. I personally use the sync.py version of the tool and after debugging for like 1,5 hours I figured out what was causing the error, but not why.

It has to do with parsing the scripts, it tries to access some newly added cookielaw script on a different domain, but errors out when trying to load it (but when you manually go there, the resource is available). Since it's the first script that is found in the DOM, i simply pop it from the script_urls array.

This is what my get_credentials() in sync.py looks like:

    def get_credentials(self):
        url = random.choice(util.SCRAPE_URLS)
        page_text = get_page(url)
        script_urls = util.find_script_urls(page_text)
        script_urls.pop(0) # to remove cookielaw.org .js from the list (first .js in DOM)
        for script in script_urls:
            if not self.client_id:
                if type(script) is str and not "":
                    js_text = f'{get_page(script)}'
                    self.client_id = util.find_client_id(js_text)

I'm going to assume this also works with the async.py, but I haven't tested it.

saadze commented 3 years ago

@DJoepie Thank you so much ! It worked for me

DannyDannyDanny commented 3 years ago

The first example (from soundcloud-lib) fails:

from sclib import SoundcloudAPI, Track, Playlist

api = SoundcloudAPI()  # never pass a Soundcloud client ID that did not come from this library

track = api.resolve('https://soundcloud.com/itsmeneedle/sunday-morning')

assert type(track) is Track

filename = f'./{track.artist} - {track.title}.mp3'

with open(filename, 'wb+') as fp:
    track.write_mp3_to(fp)

Even after applying the patch suggested by @DJoepie, I get same error as OP:

  File "soundcloud_dl.py", line 7, in <module>
    track = api.resolve(target_url)
  File "/Users/dth/.local/share/virtualenvs/python_experiments-VCQIX0Is/lib/python3.6/site-packages/sclib/sync.py", line 59, in resolve
    self.get_credentials()
  File "/Users/dth/.local/share/virtualenvs/python_experiments-VCQIX0Is/lib/python3.6/site-packages/sclib/sync.py", line 54, in get_credentials
    js_text = f'{get_page(script)}'
  File "/Users/dth/.local/share/virtualenvs/python_experiments-VCQIX0Is/lib/python3.6/site-packages/sclib/sync.py", line 14, in get_page
    return get_url(url).decode('utf-8')
  File "/Users/dth/.local/share/virtualenvs/python_experiments-VCQIX0Is/lib/python3.6/site-packages/sclib/sync.py", line 11, in get_url
    return urlopen(url).read()
  File "/Users/dth/.pyenv/versions/3.6.9/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/dth/.pyenv/versions/3.6.9/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/Users/dth/.pyenv/versions/3.6.9/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/dth/.pyenv/versions/3.6.9/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/Users/dth/.pyenv/versions/3.6.9/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/Users/dth/.pyenv/versions/3.6.9/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

DJoepie commented 3 years ago

hey @DannyDannyDanny You're right, it's appears to be broken again. This time the 'https://cdn.cookielaw.org' script is no longer the first on the DOM, so something new to remove it from the script_urls needs to be written (rather than just popping the first from the array), I can't imagine that it is a hard function to write, I don't have the time for it until later today.

I'll keep you all posted.

DJoepie commented 3 years ago

Alright, Now I've got a more permanent fix for the problem. Now I filter out 'cookielaw.org' in util.py. I'll make a pull request for this small fix tomorrow or something.

Revert back def get_credentials(self): to the original state (remove script_urls.pop(0))

and open the util.py file and look for the following function def find_script_urls(html_text):

All I did was looking for the cookielaw.org string in the scripurl. Make the function look like the following:

def find_script_urls(html_text):
    dom = BeautifulSoup(html_text, 'html.parser')
    scripts = dom.findAll('script', attrs={'src': True})
    scripts_list = []
    for script in scripts:
        src = script['src']
        if 'cookielaw.org' not in src:  # filter out cookielaw.org
            scripts_list.append(src)
    return scripts_list

3jackdaws / soundcloud-lib

urllib.error.HTTPError: HTTP Error 403: Forbidden #26