Open Benjamin-Loison opened 2 years ago
YouTube UI search by query term (Test
here) when filtering for only retrieving videos stopped after 549
results (filtered ago
as a whole word and filtering with view
gives a similar result)...
When not filtering for only retrieving videos stopped after 654
results (filtered ago
as a whole word).
Filtering with:
view
gives 702 matchesVIEW FULL PLAYLIST
gives 2 matches, so can guess there are this number of playlistssubscriber
gives 44 matches, so can guess there are this number of channelsSo this issue can't be solved easily AFAIK as the issue (limitation in fact) is on YouTube end.
Could give a try forcing YouTube Data API v3 Search: list endpoint by providing a modified page token after having reverse-engineered it as it doesn't contain randomness AFAIK.
This code snippet looks like what I am looking for.
This issue is quite similar to this Stack Overflow question.
Better understanding the pageToken
may help to solve this question.
Similar issue with the Community
tab: https://stackoverflow.com/questions/76699812/how-do-i-get-youtube-community-posts-older-than-200#comment135264020_76699812
import requests
import json
pageToken = ''
ids = set()
url = 'https://yt.lemnoslife.com/noKey/search'
params = {
'q': 'test',
'type': 'video',
'maxResults': 50,
}
while True:
params['pageToken'] = pageToken
data = requests.get(url, params = params).json()
pageToken = data['nextPageToken']
items = data['items']
for item in items:
id_ = item['id']['videoId']
ids.add(id_)
print(len(ids))
# reached 518 before KeyError: 'nextPageToken'
import requests
import json
import blackboxprotobuf
import base64
typedef = {
'1': {
'type': 'int'
},
'2': {
'type': 'int'
}
}
pageToken = ''
ids = set()
url = 'https://yt.lemnoslife.com/noKey/search'
maxResults = 50
params = {
'q': 'test',
'type': 'video',
'maxResults': maxResults,
}
requestIndex = 0
while True:
message = {
'1': requestIndex * maxResults,
'2': 0,
}
data = blackboxprotobuf.encode_message(message, typedef)
pageToken = base64.b64encode(data).decode('utf-8')
params['pageToken'] = pageToken
print(pageToken)
data = requests.get(url, params = params).json()
items = data['items']
for item in items:
id_ = item['id']['videoId']
ids.add(id_)
print(len(ids))
requestIndex += 1
# reach and stuck to 510
Should test with YouTube UI pagination as well.
curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq .items[].id.videoId
curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq .nextPageToken
curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq -r .nextPageToken | base64 -d | protoc --decode_raw
2 {
2: "test"
3: "EgIQAUgUggELOUJ2eVkyX3c2RG-CAQtkYmpQblhhYWNBVYIBCzdjQ3BaS2ZkN1hBggELNWN5c1BQblpFaE2CAQtCREJ5aXZtclZ1TYIBCzJhNFV4ZHk5VFFZggELZzRReUp1MDlrdE2CAQtNNy1oM0ZPLUtLb4IBC0k1OEp5dEpFZmRzggELdTB3dVlZbnFkNzSCAQt5ck45Nm1nbkVsMIIBC0t3ZXZvY2FYZktnggELbUpWV1gwdnVkLWeCAQtlamFJTTNHcWVzd4IBCzczWUcwb2xOWFdvggELMU9fZURSOGZCUlGCAQt5aFM5TG5Eb29fd4IBC3ZlUGM1VjRoX2tnggELX1RYLS1Ga3U5TlGCAQtaeFlaa3oyMGxZQbIBBgoECBcQAuoBBAgCECg%3D"
}
3: 52047873
4: "search-feed"
When repeating the command, get an identical 3
but different 2/3
.
2/3
is separated by -
or CAQ
or similar?
echo -n 'EgIQAUgUggELOUJ2eVkyX3c2RG' | base64 -d
H�
9BvyY2_w6Dbase64: invalid input
echo -n 'EgIQAUgUggELOUJ2eVkyX3c2RG=' | base64 -d
H�
9BvyY2_w6Dbase64: invalid input
echo -n 'EgIQAUgUggELOUJ2eVkyX3c2RG==' | base64 -d
H�
9BvyY2_w6D
echo -n 'EgIQAUgUggELOUJ2eVkyX3c2RG==' | base64 -d | protoc --decode_raw
Failed to parse input.
EgIQAUgUggELOUJ2eVkyX3c2RG-
CAQtkYmpQblhhYWNBVYIBCzdjQ3BaS2ZkN1hBggELNWN5c1BQblpFaE2
CAQtCREJ5aXZtclZ1TYIBCzJhNFV4ZHk5VFFZggELZzRReUp1MDlrdE2
CAQtNNy1oM0ZPLUtLb4IBC0k1OEp5dEpFZmRzggELdTB3dVlZbnFkNzS
CAQt5ck45Nm1nbkVsMIIBC0t3ZXZvY2FYZktnggELbUpWV1gwdnVkLWe
CAQtlamFJTTNHcWVzd4IBCzczWUcwb2xOWFdvggELMU9fZURSOGZCUlG
CAQt5aFM5TG5Eb29fd4IBC3ZlUGM1VjRoX2tnggELX1RYLS1Ga3U5TlG
CAQtaeFlaa3oyMGxZQbIBBgoECBcQAuoBBAgCECg%3D
curl -s "https://yt.lemnoslife.com/search?part=id&q=test&type=video&pageToken=`curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq -r .nextPageToken`" | jq .items[].id.videoId
curl -s "https://yt.lemnoslife.com/search?part=id&q=test&type=video&pageToken=`curl -s 'https://yt.lemnoslife.com/search?part=id&q=test&type=video' | jq -r .nextPageToken`" | jq .nextPageToken
null
this should not happen.
protoc --help
mkdir test/ && protoc test.proto --php_out test/
php a.php
PHP Fatal error: Uncaught Error: Class "GPBMetadata\A" not found in /home/benjamin/protobuf/message.php:34
Stack trace:
#0 /home/benjamin/protobuf/a.php(7): message->__construct()
#1 {main}
thrown in /home/benjamin/protobuf/message.php on line 34
Commenting \GPBMetadata\A::initOnce();
leads to:
PHP Fatal error: Uncaught InvalidArgumentException: message is not found in descriptor pool. Only generated classes may derive from Message. in /home/benjamin/protobuf/vendor/google/protobuf/src/Google/Protobuf/Internal/Message.php:74
Stack trace:
#0 /home/benjamin/protobuf/vendor/google/protobuf/src/Google/Protobuf/Internal/Message.php(55): Google\Protobuf\Internal\Message->initWithGeneratedPool()
#1 /home/benjamin/protobuf/message.php(35): Google\Protobuf\Internal\Message->__construct()
#2 /home/benjamin/protobuf/a.php(7): message->__construct()
#3 {main}
thrown in /home/benjamin/protobuf/vendor/google/protobuf/src/Google/Protobuf/Internal/Message.php on line 74
YouTube Data API v3 Search: list endpoint is limited to 500 results:
Source: Search: list#channelId
Note that this 500 limit seems to happen not only for the documentation described case.
It seems possible to fetch more than 500 results from the YT UI (would need a small tool checking that from source code after having scrolled manually), this issue shouldn't come from my reverse-engineering code.
If achieved that would help:
Could complete this list.
Related code: https://github.com/Benjamin-Loison/YouTube-operational-API/blob/9b5a7805834fd56f12afc1fb55e439a68a5a787f/search.php#L105-L117