Benjamin-Loison commented 2 years ago

YouTube Data API v3 Search: list endpoint is limited to 500 results:

Note: Search results are constrained to a maximum of 500 videos if your request specifies a value for the channelId parameter and sets the type parameter value to video, but it does not also set one of the forContentOwner, forDeveloper, or forMine filters.

Source: Search: list#channelId

Note that this 500 limit seems to happen not only for the documentation described case.

It seems possible to fetch more than 500 results from the YT UI (would need a small tool checking that from source code after having scrolled manually), this issue shouldn't come from my reverse-engineering code.

If achieved that would help:

Could complete this list.

import requests, json

def get(url):
    return requests.get(url).text

pageToken = ''

ids = []

while True:
    url = 'https://yt.lemnoslife.com/search?part=id&q=hololive&type=video'
    if pageToken != '':
        url += '&pageToken=' + pageToken
    content = get(url)
    #print(content)
    data = json.loads(content)
    pageToken = data['nextPageToken']
    items = data['items']
    for item in items:
        id = item['id']
        if not id in ids:
            ids += [id]
    print(len(ids))

# reached 437 before KeyError: 'nextPageToken'

Benjamin-Loison commented 2 years ago

YouTube UI search by query term (Test here) when filtering for only retrieving videos stopped after 549 results (filtered ago as a whole word and filtering with view gives a similar result)...

When not filtering for only retrieving videos stopped after 654 results (filtered ago as a whole word). Filtering with:

view gives 702 matches
VIEW FULL PLAYLIST gives 2 matches, so can guess there are this number of playlists
subscriber gives 44 matches, so can guess there are this number of channels

So this issue can't be solved easily AFAIK as the issue (limitation in fact) is on YouTube end.

Benjamin-Loison commented 2 years ago

Could give a try forcing YouTube Data API v3 Search: list endpoint by providing a modified page token after having reverse-engineered it as it doesn't contain randomness AFAIK.

This code snippet looks like what I am looking for.

This issue is quite similar to this Stack Overflow question.

Better understanding the pageToken may help to solve this question.

Benjamin-Loison commented 1 year ago

190 could help concerning the pagination token.

Benjamin-Loison commented 6 months ago

import requests
import json

pageToken = ''
ids = set()
url = 'https://yt.lemnoslife.com/noKey/search'
params = {
    'q': 'test',
    'type': 'video',
    'maxResults': 50,
}

while True:
    params['pageToken'] = pageToken
    data = requests.get(url, params = params).json()
    pageToken = data['nextPageToken']
    items = data['items']
    for item in items:
        id_ = item['id']['videoId']
        ids.add(id_)
    print(len(ids))

# reached 518 before KeyError: 'nextPageToken'

import requests
import json
import blackboxprotobuf
import base64

typedef = {
    '1': {
        'type': 'int'
    },
    '2': {
        'type': 'int'
    }
}

pageToken = ''
ids = set()
url = 'https://yt.lemnoslife.com/noKey/search'
maxResults = 50
params = {
    'q': 'test',
    'type': 'video',
    'maxResults': maxResults,
}
requestIndex = 0

while True:
    message = {
        '1': requestIndex * maxResults,
        '2': 0,
    }

    data = blackboxprotobuf.encode_message(message, typedef)
    pageToken = base64.b64encode(data).decode('utf-8')

    params['pageToken'] = pageToken
    print(pageToken)
    data = requests.get(url, params = params).json()
    items = data['items']
    for item in items:
        id_ = item['id']['videoId']
        ids.add(id_)
    print(len(ids))
    requestIndex += 1

# reach and stuck to 510

Should test with YouTube UI pagination as well.

curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq .items[].id.videoId
curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq .nextPageToken

curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq -r .nextPageToken | base64 -d | protoc --decode_raw

2 {
  2: "test"
  3: "EgIQAUgUggELOUJ2eVkyX3c2RG-CAQtkYmpQblhhYWNBVYIBCzdjQ3BaS2ZkN1hBggELNWN5c1BQblpFaE2CAQtCREJ5aXZtclZ1TYIBCzJhNFV4ZHk5VFFZggELZzRReUp1MDlrdE2CAQtNNy1oM0ZPLUtLb4IBC0k1OEp5dEpFZmRzggELdTB3dVlZbnFkNzSCAQt5ck45Nm1nbkVsMIIBC0t3ZXZvY2FYZktnggELbUpWV1gwdnVkLWeCAQtlamFJTTNHcWVzd4IBCzczWUcwb2xOWFdvggELMU9fZURSOGZCUlGCAQt5aFM5TG5Eb29fd4IBC3ZlUGM1VjRoX2tnggELX1RYLS1Ga3U5TlGCAQtaeFlaa3oyMGxZQbIBBgoECBcQAuoBBAgCECg%3D"
}
3: 52047873
4: "search-feed"

When repeating the command, get an identical 3 but different 2/3. 2/3 is separated by - or CAQ or similar?

echo -n 'EgIQAUgUggELOUJ2eVkyX3c2RG' | base64 -d

H�
  9BvyY2_w6Dbase64: invalid input

echo -n 'EgIQAUgUggELOUJ2eVkyX3c2RG=' | base64 -d

H�
  9BvyY2_w6Dbase64: invalid input

echo -n 'EgIQAUgUggELOUJ2eVkyX3c2RG==' | base64 -d

H�
  9BvyY2_w6D

echo -n 'EgIQAUgUggELOUJ2eVkyX3c2RG==' | base64 -d | protoc --decode_raw

Failed to parse input.

EgIQAUgUggELOUJ2eVkyX3c2RG-
CAQtkYmpQblhhYWNBVYIBCzdjQ3BaS2ZkN1hBggELNWN5c1BQblpFaE2
CAQtCREJ5aXZtclZ1TYIBCzJhNFV4ZHk5VFFZggELZzRReUp1MDlrdE2
CAQtNNy1oM0ZPLUtLb4IBC0k1OEp5dEpFZmRzggELdTB3dVlZbnFkNzS
CAQt5ck45Nm1nbkVsMIIBC0t3ZXZvY2FYZktnggELbUpWV1gwdnVkLWe
CAQtlamFJTTNHcWVzd4IBCzczWUcwb2xOWFdvggELMU9fZURSOGZCUlG
CAQt5aFM5TG5Eb29fd4IBC3ZlUGM1VjRoX2tnggELX1RYLS1Ga3U5TlG
CAQtaeFlaa3oyMGxZQbIBBgoECBcQAuoBBAgCECg%3D

curl -s "https://yt.lemnoslife.com/search?part=id&q=test&type=video&pageToken=`curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq -r .nextPageToken`" | jq .items[].id.videoId

curl -s "https://yt.lemnoslife.com/search?part=id&q=test&type=video&pageToken=`curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq -r .nextPageToken`" | jq .nextPageToken

null

this should not happen.

Benjamin-Loison commented 6 months ago

protoc --help

mkdir test/ && protoc test.proto --php_out test/

php a.php

PHP Fatal error:  Uncaught Error: Class "GPBMetadata\A" not found in /home/benjamin/protobuf/message.php:34
Stack trace:
#0 /home/benjamin/protobuf/a.php(7): message->__construct()
#1 {main}
  thrown in /home/benjamin/protobuf/message.php on line 34

Commenting \GPBMetadata\A::initOnce(); leads to:

PHP Fatal error:  Uncaught InvalidArgumentException: message is not found in descriptor pool. Only generated classes may derive from Message. in /home/benjamin/protobuf/vendor/google/protobuf/src/Google/Protobuf/Internal/Message.php:74
Stack trace:
#0 /home/benjamin/protobuf/vendor/google/protobuf/src/Google/Protobuf/Internal/Message.php(55): Google\Protobuf\Internal\Message->initWithGeneratedPool()
#1 /home/benjamin/protobuf/message.php(35): Google\Protobuf\Internal\Message->__construct()
#2 /home/benjamin/protobuf/a.php(7): message->__construct()
#3 {main}
  thrown in /home/benjamin/protobuf/vendor/google/protobuf/src/Google/Protobuf/Internal/Message.php on line 74

Benjamin-Loison / YouTube-operational-API

YT operational API Search endpoint not able to fetch more than 500 results #4

190 could help concerning the pagination token.