critrolesync / critrolesync.github.io

https://critrolesync.github.io
MIT License
8 stars 3 forks source link

Truly dynamic ads #118

Open jpgill86 opened 1 year ago

jpgill86 commented 1 year ago

In the last few days, I've discovered a new major problem for CritRoleSync: Advertisements have become much more dynamic than before (compare #6). This issue seems to apply only to the newest podcast feed (Critical Role; C2E20 and later).

Every time a user tries to stream or download a podcast episode, their device uses a URL published in the podcast feed to access an MP3 file. I discovered that this published URL always redirects to a different address, and -- here's the rub -- the new target URL changes frequently and serves up a different version of the file when it does. Different versions contain different ads and have different durations, making CritRoleSync's synchronization timestamps inconsistent for users and very difficult to debug.

I have tried investigating what factors may influence which version of the MP3 file is served to the user. Some episodes appear to be affected more regularly than others. Changing the User-Agent string used in the header of the HTTP request (which contains information about the web browser and operating system of the device) frequently affects which file version is served. Even when the User-Agent string is fixed, the file version sometimes changes for other reasons. I am guessing this may be related to how much time has passed since the last request, or perhaps it depends on the user's IP address or geographic location (which can be tested using a VPN). It's possible that, on top of these factors, there is randomization as well. Without insider information on how the file serving algorithm works, this is very difficult to analyze.

My automated GitHub Actions system for archiving the podcast feeds and checking for differences (archive-podcast-feeds.yml) can detect changes in the published URL. However, the URL redirection changes I'm describing here are entirely opaque to this automated system since they happen only when the user tries to access an episode.

All of this is very bad news for CritRoleSync. Unlike #6, where ads seemed to be changed systematically for all users, only on rare occasions, and always with podcast feed updates indicating the changes in published URLs and durations, this new issue is much more problematic for CritRoleSync. If different versions of the MP3 files with different ads and durations are being served to different users at the same time, there may be no way for CritRoleSync to predict which version a user is listening to, and so there will be no way to provide precise synchronization timestamps. Approximate timestamps could likely still be provided, with an inaccuracy of perhaps a couple minutes.

jpgill86 commented 1 year ago

Here is some Python code for investigating this issue:

import requests
from urllib.parse import urlparse, parse_qs
from critrolesync import get_podcast_feed_from_id

episode_ids = ['C3E53']

user_agents = {
    # my installed browsers
    # 'chrome': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36',
    # 'firefox': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0',

    # examples from the fake_useragent library
    'chrome': 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.59 Safari/525.19',
    'firefox': 'Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.3) Gecko/2008092700 SUSE/3.0.3-2.2 Firefox/3.0.3',
}

def print_redirect_info(published_url, user_agent=None):
    headers = None
    if user_agent is not None:
        headers = {'User-Agent': user_agent}
    with requests.get(published_url, stream=True, headers=headers) as r:
        print(f'USER-AGENT:    {r.request.headers["User-Agent"]}')
        print(f'REDIRECTED TO: {r.url}')
        print(f'FILE:          {urlparse(r.url).path.split("/")[-1]}')
        print(f'BYTES:         {parse_qs(r.url)["x-total-bytes"][0]}')
    print()

for episode_id in episode_ids:
    print(f'=== {episode_id} ===')
    published_url = get_podcast_feed_from_id(episode_id)['URL']
    print(f'PUBLISHED URL: {published_url}')
    print()

    print('--- PYTHON REQUESTS DEFAULT ---')
    print_redirect_info(published_url)

    print('--- CHROME ---')
    print_redirect_info(published_url, user_agents['chrome'])

    print('--- FIREFOX ---')
    print_redirect_info(published_url, user_agents['firefox'])

    print()

The result I obtain (as of right now) is the following:

=== C3E53 ===
PUBLISHED URL: https://stitcher.simplecastaudio.com/a74173a8-254a-45ff-aee0-b0cd0458d7f1/episodes/3d0f5447-f24a-489b-a47b-6af21da0319c/audio/128/default.mp3?aid=rss_feed&awCollectionId=a74173a8-254a-45ff-aee0-b0cd0458d7f1&awEpisodeId=3d0f5447-f24a-489b-a47b-6af21da0319c&feed=LXz4Q9rJ

--- PYTHON REQUESTS DEFAULT ---
USER-AGENT:    python-requests/2.26.0
REDIRECTED TO: https://stitcher.simplecastaudio.com/a74173a8-254a-45ff-aee0-b0cd0458d7f1/episodes/3d0f5447-f24a-489b-a47b-6af21da0319c/audio/128/default.mp3/default.mp3_ywr3ahjkcgo_bae30afefc2710a113729268fcecbf1d_222744853.mp3?aid=rss_feed&awCollectionId=a74173a8-254a-45ff-aee0-b0cd0458d7f1&awEpisodeId=3d0f5447-f24a-489b-a47b-6af21da0319c&feed=LXz4Q9rJ&hash_redirect=1&x-total-bytes=222744853&x-ais-classified=unclassified&listeningSessionID=0CD_382_135__452d9a56c67d57b92492f24dbc6f03855302e67d
FILE:          default.mp3_ywr3ahjkcgo_bae30afefc2710a113729268fcecbf1d_222744853.mp3
BYTES:         222744853

--- CHROME ---
USER-AGENT:    Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.59 Safari/525.19
REDIRECTED TO: https://stitcher.simplecastaudio.com/a74173a8-254a-45ff-aee0-b0cd0458d7f1/episodes/3d0f5447-f24a-489b-a47b-6af21da0319c/audio/128/default.mp3/default.mp3_ywr3ahjkcgo_6fcddd6060603bdea1c2d7b638db617c_225260965.mp3?aid=rss_feed&awCollectionId=a74173a8-254a-45ff-aee0-b0cd0458d7f1&awEpisodeId=3d0f5447-f24a-489b-a47b-6af21da0319c&feed=LXz4Q9rJ&hash_redirect=1&x-total-bytes=225260965&x-ais-classified=streaming&listeningSessionID=0CD_382_135__db2d77abaeb729bccbe000bdf5a6d2f1649fd5f1
FILE:          default.mp3_ywr3ahjkcgo_6fcddd6060603bdea1c2d7b638db617c_225260965.mp3
BYTES:         225260965

--- FIREFOX ---
USER-AGENT:    Mozilla/5.0 (X11; U; Linux i686; pl-PL; rv:1.9.0.3) Gecko/2008092700 SUSE/3.0.3-2.2 Firefox/3.0.3
REDIRECTED TO: https://stitcher.simplecastaudio.com/a74173a8-254a-45ff-aee0-b0cd0458d7f1/episodes/3d0f5447-f24a-489b-a47b-6af21da0319c/audio/128/default.mp3/default.mp3_ywr3ahjkcgo_80cc279e697ad5b213c8b01c6c5057bd_224535389.mp3?aid=rss_feed&awCollectionId=a74173a8-254a-45ff-aee0-b0cd0458d7f1&awEpisodeId=3d0f5447-f24a-489b-a47b-6af21da0319c&feed=LXz4Q9rJ&hash_redirect=1&x-total-bytes=224535389&x-ais-classified=unclassified&listeningSessionID=0CD_382_135__15158f9b733378ddc181b04c144559682a5a37a5
FILE:          default.mp3_ywr3ahjkcgo_80cc279e697ad5b213c8b01c6c5057bd_224535389.mp3
BYTES:         224535389

Note that the file size for each result is different.

If I repeat this using the commented-out User-Agent strings (corresponding to my locally installed browsers), or if I simply remove "en-US" from the Chrome User-Agent string, the Firefox result remains the same but the Chrome result changes to be the same as the Firefox result. (EDIT: Just noticed that the example Firefox User-Agent string I was using here had "pl-PL" for Polish language.)

When tested against most other episodes (e.g., C3E54), the Chrome and Firefox results are usually the same, but they always differ from the Python Requests library default user agent.

jpgill86 commented 1 year ago

Test Results

When using the Python Requests library default user agent, the same version of each episode is always served, for all episodes, even when the IP address (and city) is changed using a VPN. The default user agent has been in use for all auto-syncing in the past, so these versions were likely used when obtaining timestamps.

When using a Chrome or Firefox user agent, different MP3 versions are served for all episodes (that we care about) relative to the Python Requests library default agent. This means that all timestamps obtained using auto-syncing in the past are now inaccurate relative to MP3 versions served in a browser (and presumably in most podcast managers too). Re-auto-syncing using the default user agent should not help (still needs to be tested).

When using a Chrome or Firefox user agent, keeping the User-Agent string the same but refreshing my VPN to obtain a new IP address, even in the same city, is enough for a random smattering of two dozen episodes to change versions in a matter of minutes. This means that even users using similar devices at the same time will, if their IP addresses differ, be served different versions for a minority of episodes. Consequently, there is no longer always a single solution to time conversion.

Changing the User-Agent string from one browser to another while keeping the IP address fixed has similar results. This means that even on the same device, different MP3 versions may be served for a minority of episodes if the app is changed.

In these tests with browser user agents, the minority of episodes affected seems to be randomly different each time.

Conclusion

CritRoleSync is sunk.

OK, maybe it's not that grim, but I may have to accept significant syncing inaccuracies of up to a couple minutes going forward, which is both very unsatisfying and will make debugging issues much harder.