Benjamin-Loison / YouTube-operational-API

YouTube operational API works when YouTube Data API v3 fails.
374 stars 45 forks source link

YT operational API Search endpoint not able to fetch more than 500 results #4

Open Benjamin-Loison opened 2 years ago

Benjamin-Loison commented 2 years ago

YouTube Data API v3 Search: list endpoint is limited to 500 results:

Note: Search results are constrained to a maximum of 500 videos if your request specifies a value for the channelId parameter and sets the type parameter value to video, but it does not also set one of the forContentOwner, forDeveloper, or forMine filters.

Source: Search: list#channelId

Note that this 500 limit seems to happen not only for the documentation described case.

It seems possible to fetch more than 500 results from the YT UI (would need a small tool checking that from source code after having scrolled manually), this issue shouldn't come from my reverse-engineering code.

If achieved that would help:

Could complete this list.

import requests, json

def get(url):
    return requests.get(url).text

pageToken = ''

ids = []

while True:
    url = 'https://yt.lemnoslife.com/search?part=id&q=hololive&type=video'
    if pageToken != '':
        url += '&pageToken=' + pageToken
    content = get(url)
    #print(content)
    data = json.loads(content)
    pageToken = data['nextPageToken']
    items = data['items']
    for item in items:
        id = item['id']
        if not id in ids:
            ids += [id]
    print(len(ids))

# reached 437 before KeyError: 'nextPageToken'

Related code: https://github.com/Benjamin-Loison/YouTube-operational-API/blob/9b5a7805834fd56f12afc1fb55e439a68a5a787f/search.php#L105-L117

Benjamin-Loison commented 2 years ago

YouTube UI search by query term (Test here) when filtering for only retrieving videos stopped after 549 results (filtered ago as a whole word and filtering with view gives a similar result)...

When not filtering for only retrieving videos stopped after 654 results (filtered ago as a whole word). Filtering with:

So this issue can't be solved easily AFAIK as the issue (limitation in fact) is on YouTube end.

Benjamin-Loison commented 2 years ago

Could give a try forcing YouTube Data API v3 Search: list endpoint by providing a modified page token after having reverse-engineered it as it doesn't contain randomness AFAIK.

This code snippet looks like what I am looking for.

This issue is quite similar to this Stack Overflow question.

Better understanding the pageToken may help to solve this question.

Benjamin-Loison commented 1 year ago

Similar issue with the Community tab: https://stackoverflow.com/questions/76699812/how-do-i-get-youtube-community-posts-older-than-200#comment135264020_76699812

Benjamin-Loison commented 1 year ago

190 could help concerning the pagination token.

Benjamin-Loison commented 6 months ago
import requests
import json

pageToken = ''
ids = set()
url = 'https://yt.lemnoslife.com/noKey/search'
params = {
    'q': 'test',
    'type': 'video',
    'maxResults': 50,
}

while True:
    params['pageToken'] = pageToken
    data = requests.get(url, params = params).json()
    pageToken = data['nextPageToken']
    items = data['items']
    for item in items:
        id_ = item['id']['videoId']
        ids.add(id_)
    print(len(ids))

# reached 518 before KeyError: 'nextPageToken'
import requests
import json
import blackboxprotobuf
import base64

typedef = {
    '1': {
        'type': 'int'
    },
    '2': {
        'type': 'int'
    }
}

pageToken = ''
ids = set()
url = 'https://yt.lemnoslife.com/noKey/search'
maxResults = 50
params = {
    'q': 'test',
    'type': 'video',
    'maxResults': maxResults,
}
requestIndex = 0

while True:
    message = {
        '1': requestIndex * maxResults,
        '2': 0,
    }

    data = blackboxprotobuf.encode_message(message, typedef)
    pageToken = base64.b64encode(data).decode('utf-8')

    params['pageToken'] = pageToken
    print(pageToken)
    data = requests.get(url, params = params).json()
    items = data['items']
    for item in items:
        id_ = item['id']['videoId']
        ids.add(id_)
    print(len(ids))
    requestIndex += 1

# reach and stuck to 510

Should test with YouTube UI pagination as well.

curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq .items[].id.videoId
curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq .nextPageToken
curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq -r .nextPageToken | base64 -d | protoc --decode_raw
2 {
  2: "test"
  3: "EgIQAUgUggELOUJ2eVkyX3c2RG-CAQtkYmpQblhhYWNBVYIBCzdjQ3BaS2ZkN1hBggELNWN5c1BQblpFaE2CAQtCREJ5aXZtclZ1TYIBCzJhNFV4ZHk5VFFZggELZzRReUp1MDlrdE2CAQtNNy1oM0ZPLUtLb4IBC0k1OEp5dEpFZmRzggELdTB3dVlZbnFkNzSCAQt5ck45Nm1nbkVsMIIBC0t3ZXZvY2FYZktnggELbUpWV1gwdnVkLWeCAQtlamFJTTNHcWVzd4IBCzczWUcwb2xOWFdvggELMU9fZURSOGZCUlGCAQt5aFM5TG5Eb29fd4IBC3ZlUGM1VjRoX2tnggELX1RYLS1Ga3U5TlGCAQtaeFlaa3oyMGxZQbIBBgoECBcQAuoBBAgCECg%3D"
}
3: 52047873
4: "search-feed"

When repeating the command, get an identical 3 but different 2/3. 2/3 is separated by - or CAQ or similar?

echo -n 'EgIQAUgUggELOUJ2eVkyX3c2RG' | base64 -d
H�
  9BvyY2_w6Dbase64: invalid input
echo -n 'EgIQAUgUggELOUJ2eVkyX3c2RG=' | base64 -d
H�
  9BvyY2_w6Dbase64: invalid input
echo -n 'EgIQAUgUggELOUJ2eVkyX3c2RG==' | base64 -d
H�
  9BvyY2_w6D
echo -n 'EgIQAUgUggELOUJ2eVkyX3c2RG==' | base64 -d | protoc --decode_raw
Failed to parse input.
EgIQAUgUggELOUJ2eVkyX3c2RG-
CAQtkYmpQblhhYWNBVYIBCzdjQ3BaS2ZkN1hBggELNWN5c1BQblpFaE2
CAQtCREJ5aXZtclZ1TYIBCzJhNFV4ZHk5VFFZggELZzRReUp1MDlrdE2
CAQtNNy1oM0ZPLUtLb4IBC0k1OEp5dEpFZmRzggELdTB3dVlZbnFkNzS
CAQt5ck45Nm1nbkVsMIIBC0t3ZXZvY2FYZktnggELbUpWV1gwdnVkLWe
CAQtlamFJTTNHcWVzd4IBCzczWUcwb2xOWFdvggELMU9fZURSOGZCUlG
CAQt5aFM5TG5Eb29fd4IBC3ZlUGM1VjRoX2tnggELX1RYLS1Ga3U5TlG
CAQtaeFlaa3oyMGxZQbIBBgoECBcQAuoBBAgCECg%3D
curl -s "https://yt.lemnoslife.com/search?part=id&q=test&type=video&pageToken=`curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq -r .nextPageToken`" | jq .items[].id.videoId
curl -s "https://yt.lemnoslife.com/search?part=id&q=test&type=video&pageToken=`curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq -r .nextPageToken`" | jq .nextPageToken
null

this should not happen.

Benjamin-Loison commented 6 months ago
protoc --help
mkdir test/ && protoc test.proto --php_out test/
php a.php
PHP Fatal error:  Uncaught Error: Class "GPBMetadata\A" not found in /home/benjamin/protobuf/message.php:34
Stack trace:
#0 /home/benjamin/protobuf/a.php(7): message->__construct()
#1 {main}
  thrown in /home/benjamin/protobuf/message.php on line 34

Commenting \GPBMetadata\A::initOnce(); leads to:

PHP Fatal error:  Uncaught InvalidArgumentException: message is not found in descriptor pool. Only generated classes may derive from Message. in /home/benjamin/protobuf/vendor/google/protobuf/src/Google/Protobuf/Internal/Message.php:74
Stack trace:
#0 /home/benjamin/protobuf/vendor/google/protobuf/src/Google/Protobuf/Internal/Message.php(55): Google\Protobuf\Internal\Message->initWithGeneratedPool()
#1 /home/benjamin/protobuf/message.php(35): Google\Protobuf\Internal\Message->__construct()
#2 /home/benjamin/protobuf/a.php(7): message->__construct()
#3 {main}
  thrown in /home/benjamin/protobuf/vendor/google/protobuf/src/Google/Protobuf/Internal/Message.php on line 74