TeamNewPipe / NewPipe

A libre lightweight streaming front-end for Android.
https://newpipe.net
GNU General Public License v3.0
31.42k stars 3.06k forks source link

PSA: Google Takeout creates broken subscriptions.json #5806

Closed nurupo closed 3 years ago

nurupo commented 3 years ago

While not a bug in NewPipe, I think it's important to raise awareness of this as NewPipe suggests using Google Takeout for importing user's YouTube subscription list.

The issue is that you currently can't reply on Google Takeout to generate the correct subscriptions.json.

I'm subscribed to 322 unique [sic] channels. I downloaded my subscriptions.json from Google Takeout. It contains 322 channels in total -- so far so good. I checked how many unique channels it has (using jq -r '.[] | [.snippet.resourceId.channelId] | @tsv' < subscriptions.json | sort -h | uniq | wc -l) -- just 209 (!). That obviously should be 322 channels instead. Turns out 322-209=133 of the channels are exact, character by character, duplicate json entries, and therefore there are 133 channels missing in my subscriptions.json due to those duplicates.

I decided to abandon Google Takeout and instead use YouTube API to get the list of channels I'm subscribed to. At first I goofed -- I didn't define the sorting order when making the API call, so it used the default sorting order of "relevance" and I got a result very similar to Google Takeout -- 322 total channels but only 218 of them unique. That's 9 more! When sorted and diffed against Google Takeout's list, there were many additions and removals:

$ cat subscriptions1.json | jq -r '.[] | [.snippet.resourceId.channelId] | @tsv' | sort -h | uniq > out1
$ cat subscriptions2.json | jq -r '.[] | [.snippet.resourceId.channelId] | @tsv' | sort -h | uniq > out2
$ git diff --stat out1 out2
 out1 => out2 | 157 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------------------------------
 1 file changed, 74 insertions(+), 83 deletions(-)

so the two of them are actually very different. Turns out the "relevance" sorting order is not very "stable". The API returns 50 channels per request, so I needed to do pagination -- that's ceil(322/50)=7 pages in total, with some pages including duplicate channels due to the "relevance" sorting order changing the channel order between page requests. Once I have set the sorting order to "alphabetical" I got all 322 unique channels, no duplicates -- a success.

I suspect that Google Takeout uses a similar (or the same) API call in its backend and it does the same mistake of using the default sorting order, which turns out to be as good as the shuffle sort. If this theory is right, then this issue is more pronounced the more channels a user is subscribed to, as there is more pagination going on and more chances for the "relevance" sorting order to mess things up, and similarly, if a user is subscribed only to a handful of channels, they might not be affected at all.

I have reported this to Google using the feedback form for Google Takeout, so maybe they will fix this.

Here is a python script if you want to get the correct subscriptions.json from the API. Follow [the Quick Start Python guide](https://developers.google.com/youtube/v3/quickstart/python) (the OAuth part of it) and use the python script below. This just repeats what the Quick Start Python guide says: ``` apt-get install python3-virtualenv virtualenv -p /usr/bin/python3 env source env/bin/activate pip install google-api-python-client google-auth-oauthlib google-auth-httplib2 ``` [Create a project](https://console.developers.google.com/), add YouTube Data API v3 library to it, give it only readonly access to YouTube, create an OAuth consent screen for the project (External), add the user you want to get the subscriptions for to it, create an OAuth Client Id and download its JSON renaming it to `client_secret.json` for the python script to use, then run: ```python import json import os import google_auth_oauthlib.flow import googleapiclient.discovery import googleapiclient.errors scopes = ["https://www.googleapis.com/auth/youtube.readonly"] def main(): api_service_name = "youtube" api_version = "v3" client_secrets_file = "client_secret.json" # Get credentials and create an API client flow = google_auth_oauthlib.flow.InstalledAppFlow.from_client_secrets_file(client_secrets_file, scopes) credentials = flow.run_console() youtube = googleapiclient.discovery.build(api_service_name, api_version, credentials=credentials) # Get YouTube channel Id request = youtube.channels().list( part="id", mine=True, prettyPrint=True ) response = request.execute() my_yt_id = response['items'][0]['id'] # Get subscriptions pageToken = '' subscriptions = [] while True: request = youtube.subscriptions().list( part="contentDetails,snippet", channelId=my_yt_id, order="alphabetical", maxResults=50, pageToken=pageToken, prettyPrint=True ) response = request.execute() subscriptions.extend(response['items']) if 'nextPageToken' not in response: break pageToken = response['nextPageToken'] with open('subscriptions.json', 'w', encoding='utf-8') as f: json.dump(subscriptions, f, ensure_ascii=False, indent=2) if __name__ == "__main__": main() ``` Follow the prompt (it will give you a link to visit that will give you a code) and if everything is right you will get your `subscriptions.json`.

UPDATE 2021-04-11: the issue still persists. UPDATE 2021-07-12: the issue is fixed.

jvvcn7 commented 3 years ago

I can confirm it's still broken. I set Takeout to generate a JSON of youtube subscriptions-only, of which I have >300. Now the JSON file does report:

"totalItemCount" : 338

But GoogleTakeout is generating subscription entries in the JSON file for a whole whopping FOUR SUBSCRIPTIONS!

Google Support has not respnded, but its its clearly broken.

ZirconCode commented 3 years ago

Can confirm that this is still an issue.

nurupo commented 3 years ago

Looks like Google has fixed the issue, my subscriptions.json exported from Google Takeout contains all the channels now.

The channels in subscriptions.json appear to be sorted alphabetically by the "title" field now, when they were unsorted before, which kind of confirms the theory of Google not specifying the sorting order of channels when creating subscriptions.json causing this issue.