Open Benjamin-Loison opened 5 months ago
Note that for simplifying Protobuf I usually proceed level by level as it helps me prototyping.
None of both following work:
data = base64.b64decode(base64.urlsafe_b64decode('Q2c5RFNVTmpiVXRCUjBWTGFsWm9Wa0VRQUE='), altchars = '-/')
data = base64.urlsafe_b64decode(base64.urlsafe_b64decode('Q2c5RFNVTmpiVXRCUjBWTGFsWm9Wa0VRQUE='))
However, considering Cg9DSUNjbUtBR0VLalZoVkEQAA==
as Protobuf works fine!
The question is where does this timestamp comes from.
Note that the community post id does not seem just base64 encoded, nor Protobuf.
1678000128
works but 1678000127
returns:Note the This channel hasn't posted yet
.
Should check the timestamp again on the next community post publication to see if it still works.
I set a notification system to notify me once a new community post is published on this channel.
{
'url': 'https://yt.lemnoslife.com/channels?part=community&id=UCpVm7bg6pXKo1Pr6k5kxG9A',
'message': 'New YouTube community post, consider investigating again https://github.com/Benjamin-Loison/YouTube-operational-API/issues/257',
'filterText': lambda text : json.loads(text)['items'][0]['community'][0]['id'],
},
Multiple comments may have been published as I stopped for a while the notification algorithm.
Well this timestamp does not work anymore and now the next page timestamp is 1679071734
.
date -d @1678118400
Mon Mar 6 05:00:00 PM CET 2023
date -d @1679071734
Fri Mar 17 05:48:54 PM CET 2023
The latter date does not seem related to current date and it does not seem clear to state if it is the post date as we only know that it was 1 year ago
, as channels?part=community
seems to confirm.
print(json.dumps(getCommunityPost(1678118400 + 3600 * 24 * 25)['continuationContents']['itemSectionContinuation']['contents'][0]))
print(getCommunityPost(1678118400 + 3600 * 24 * 25)['continuationContents']['itemSectionContinuation']['contents'][0]['backstagePostThreadRenderer']['post']['backstagePostRenderer']['postId'])
UgkxWv2lH-5Jn02GWCPtgSz7nieXeclHLoG4
len(communityPosts)
200
'UgkxWv2lH-5Jn02GWCPtgSz7nieXeclHLoG4' in communityPosts
True
seems to show that using older than normally currently provided timestamp does not help.
Would answer the Stack Overflow question 78985164.
To verify again that unable to get more than 200 community posts can list first 200 ones and from last continuation timestamp remove one hour per hour until a significant amount of time showing that no community post posted before is retrievable. To make sure could go until channel creation. Note that this is assuming a community post has been published more than one hour before the first retrievable one which seems to be a reasonable assumption if some posts have been posted.
Based on #295#issuecomment-2267626825:
Consider last params['pageToken']
when iterating over YouTube operational API channels?part=community
import requests
import blackboxprotobuf
import base64
import time
import math
from datetime import datetime, timedelta
import re
from enum import Enum, auto
import time
import urllib.parse as ul
CHANNEL_ID = 'UCpVm7bg6pXKo1Pr6k5kxG9A'
YOUTUBE_OPERATIONAL_API_INSTANCE_URL = 'http://localhost/YouTube-operational-API'
def getBase64Protobuf(message, typedef):
data = blackboxprotobuf.encode_message(message, typedef)
return base64.b64encode(data).decode('ascii')
def getContinuation(timestamp):
message = {
'1': timestamp,
}
typedef = {
'1': {
'type': 'int'
},
}
one = getBase64Protobuf(message, typedef)
message = {
'1': one,
}
typedef = {
'1': {
'type': 'string'
},
}
one = base64.b64encode(getBase64Protobuf(message, typedef).encode('ascii'))
message = {
'2': 'community',
'53': {
'1': one,
},
}
typedef = {
'2': {
'type': 'string'
},
'53': {
'type': 'message',
'message_typedef': {
'1': {
'type': 'string'
},
},
},
}
three = getBase64Protobuf(message, typedef)
message = {
'80226972': {
'2': CHANNEL_ID,
'3': three,
}
}
typedef = {
'80226972': {
'type': 'message',
'message_typedef': {
'2': {
'type': 'string'
},
'3': {
'type': 'string'
},
},
'field_order': [
'2',
'3',
]
}
}
continuation = getBase64Protobuf(message, typedef)
return continuation
def getCommunity(timestamp):
continuation = getContinuation(timestamp)
json_data = {
'context': {
'client': {
'clientName': 'WEB',
'clientVersion': '2.20240731.04.00',
},
},
'continuation': continuation,
}
response = requests.post('https://www.youtube.com/youtubei/v1/browse', json = json_data)
return response.json()
def getApi(url, params):
return requests.get(YOUTUBE_OPERATIONAL_API_INSTANCE_URL + f'/{url}', params).json()
params = {
'part': 'about',
'id': CHANNEL_ID,
}
channelJoinedDateTime = getApi('channels', params)['items'][0]['about']['stats']['joinedDate']
HOURS_AGO = re.compile('(\d) hour(s|) ago')
DAYS_AGO = re.compile('(\d) day(s|) ago')
MONTHS_AGO = re.compile('(\d) month(s|) ago')
class Approximation(Enum):
UPPER = auto()
LOWER = auto()
# To avoid possible time shift issue in community post date string and pagination.
MOST_TIME_SHIFT = timedelta(days = 1)
def getTimeDelta(timeDeltaStr, approximation):
hoursAgoMatch = HOURS_AGO.match(timeDeltaStr)
relativeOffsetUnit = -1 if approximation is Approximation.UPPER else 1
relativeOffset = relativeOffsetUnit * MOST_TIME_SHIFT
if hoursAgoMatch is not None:
myTimedelta = timedelta(hours = int(hoursAgoMatch[1]) + relativeOffsetUnit)
daysAgoMatch = DAYS_AGO.match(timeDeltaStr)
if daysAgoMatch is not None:
myTimedelta = timedelta(days = int(daysAgoMatch[1]) + relativeOffsetUnit)
monthsAgoMatch = MONTHS_AGO.match(timeDeltaStr)
#if monthsAgoMatch is not None:
# Between 28 and 31.
# myTimedelta = timedelta(days = 31 * int(monthsAgoMatch[1]) + relativeOffsetUnit)
return int((myTimedelta + relativeOffset).total_seconds())
def decodeBase64Protobuf(base64Protobuf):
#print(ul.unquote_plus(base64Protobuf))
data = base64.b64decode(ul.unquote_plus(base64Protobuf) + '==', altchars = '-_')
message = blackboxprotobuf.decode_message(data)[0]
return message
def getTimestampFromPageToken(pageToken):
message = decodeBase64Protobuf(pageToken)
#print(json.dumps(message['80226972']['3'], indent = 4))
message = decodeBase64Protobuf(message['80226972']['3'])
message = decodeBase64Protobuf(base64.b64decode(message['53']['1'], altchars = '-_').decode('ascii'))
message = decodeBase64Protobuf(message['1'])
#print(json.dumps(message, indent = 4))
return message['1']
communityPosts = []
params = {
'part': 'community',
'id': CHANNEL_ID,
}
communityPostIds = set()
while True:
data = getApi('channels', params)
item = data['items'][0]
for communityPost in item['community']:
communityPosts += [{
'id': communityPost['id'],
'date': communityPost['date'],
}]
communityPostIds.add(communityPost['id'])
print(len(communityPostIds))
if not 'nextPageToken' in item:
break
nextPageToken = item['nextPageToken']
#print('received', nextPageToken)
timestamp = getTimestampFromPageToken(nextPageToken)# + 1
print(timestamp)
#nextPageToken = getContinuation(timestamp)
#print('sent', nextPageToken)
params['pageToken'] = nextPageToken
currentTimestamp = time.time()
currentTimestampCeil = math.ceil(currentTimestamp)
communityPostIds = set()
def getCommunityPosts(timestamp, approximation):
shift = 1
while True:
community = getCommunity(timestamp + (1 if approximation is Approximation.UPPER else -1) * shift)
items = community['continuationContents']['itemSectionContinuation']['contents']
content = items[0]
if 'messageRenderer' in content and content['messageRenderer']['text']['runs'][0]['text'] == "This channel hasn't posted yet":
#if shift == 1:
# print('2 community posts missing in a row!')
#print(f'{shift=}')
shift *= 2
continue
print(f'{len(items)=}')
for item in items:
if not 'backstagePostThreadRenderer' in item:
#print(json.dumps(item, indent = 4))
continuationToken = item['continuationItemRenderer']['continuationEndpoint']['continuationCommand']['token']
#print(continuationToken)
timestamp = getTimestampFromPageToken(continuationToken)
print(f'{timestamp=}')
return getCommunityPosts(timestamp, approximation)
itemId = item['backstagePostThreadRenderer']['post']['backstagePostRenderer']['postId']
communityPostIds.add(itemId)
print(f'{len(communityPostIds)=} {itemId=}')
break
getCommunityPosts(currentTimestampCeil, Approximation.UPPER)
Note that first timestamp returned in nextPageToken
(1723118400) is not current one (1726330724).
seems to clearly shows that even internally we cannot retrieve more than 200 results. We could be biased by pagination but here the last page has strictly less than 10 results, so it does not seem to be the case. However, we may be biased by the 200 limit but I doubt so, to verify should try the timestamp just before the oldest community post one.
Webscrap_any_website/issues/29#issuecomment-2319819 and following comments shows that we can generate quite easily our own channel with more than 200 community posts in more than 48 hours.
Would answer the Stack Overflow question 76699812.
https://www.youtube.com/post/Ugkxvor5AtR4vx01XbXSqXz-I4l9Fae1mmkc
curl.sh
is the request to https://www.youtube.com/youtubei/v1/browse when reach final page of https://www.youtube.com/@NatGeo/community.Python script using hardcoded
```py import requests url = 'https://www.youtube.com/youtubei/v1/browse' headers = { 'Content-Type': 'application/json' } data = { 'context': { 'client': { 'clientName': 'WEB', 'clientVersion': '2.20240313.05.00' } }, 'continuation': '4qmFsgKdARIYVUNwVm03Ymc2cFhLbzFQcjZrNWt4RzlBGmhFZ2xqYjIxdGRXNXBkSG00QVFDU0F3Q3FBeWtLSkZFeVl6VlNSazVXVkcxd2FWWllVa05WYWtKWFZFZEdjMWR0T1ZkaE1GWlNVVlZGUFNpLUFmSUdDUW9IU2dDaUFRSUlBUSUzRCUzRJoCFmJhY2tzdGFnZS1pdGVtLXNlY3Rpb24=' } data = requests.post(url, headers = headers, json = data).json() print('Polar Bear Day' in str(data)) ```continuation
base64 encoded string:Python script first level decoded Protobuf:
```py import requests import blackboxprotobuf import base64 def getBase64Protobuf(message, typedef): data = blackboxprotobuf.encode_message(message, typedef) return base64.b64encode(data, altchars = b'-/').decode('ascii') message = { '80226972': { '2': 'UCpVm7bg6pXKo1Pr6k5kxG9A', '3': 'Egljb21tdW5pdHm4AQCSAwCqAykKJFEyYzVSRk5WVG1waVZYUkNVakJXVEdGc1dtOVdhMFZSUVVFPSi-AfIGCQoHSgCiAQIIAQ==', '35': 'backstage-item-section' } } typedef = { '80226972': { 'type': 'message', 'message_typedef': { '2': { 'type': 'string' }, '3': { 'type': 'string' }, '35': { 'type': 'string' } }, 'field_order': [ '2', '3', '35' ] } } continuation = getBase64Protobuf(message, typedef) url = 'https://www.youtube.com/youtubei/v1/browse' headers = { 'Content-Type': 'application/json' } data = { 'context': { 'client': { 'clientName': 'WEB', 'clientVersion': '2.20240313.05.00' } }, 'continuation': continuation } data = requests.post(url, headers = headers, json = data).json() print('Polar Bear Day' in str(data)) ```Python script first level decoded Protobuf simplified:
```py import requests import blackboxprotobuf import base64 def getBase64Protobuf(message, typedef): data = blackboxprotobuf.encode_message(message, typedef) return base64.b64encode(data, altchars = b'-/').decode('ascii') message = { '80226972': { '2': 'UCpVm7bg6pXKo1Pr6k5kxG9A', '3': 'Egljb21tdW5pdHm4AQCSAwCqAykKJFEyYzVSRk5WVG1waVZYUkNVakJXVEdGc1dtOVdhMFZSUVVFPSi-AfIGCQoHSgCiAQIIAQ==', } } typedef = { '80226972': { 'type': 'message', 'message_typedef': { '2': { 'type': 'string' }, '3': { 'type': 'string' }, }, 'field_order': [ '2', '3', ] } } continuation = getBase64Protobuf(message, typedef) url = 'https://www.youtube.com/youtubei/v1/browse' headers = { 'Content-Type': 'application/json' } data = { 'context': { 'client': { 'clientName': 'WEB', 'clientVersion': '2.20240313.05.00' } }, 'continuation': continuation } data = requests.post(url, headers = headers, json = data).json() print('Polar Bear Day' in str(data)) ```Python script second level decoded Protobuf:
```py import requests import blackboxprotobuf import base64 def getBase64Protobuf(message, typedef): data = blackboxprotobuf.encode_message(message, typedef) return base64.b64encode(data, altchars = b'-/').decode('ascii') message = { '2': 'community', '23': 0, '50': {}, '53': { '1': 'Q2c5RFNVTmpiVXRCUjBWTGFsWm9Wa0VRQUE=', '5': 190 }, '110': { '1': { '9': {}, '20': { '1': 1 } } } } typedef = { '2': { 'type': 'string' }, '23': { 'type': 'int' }, '50': { 'type': 'message', 'message_typedef': {}, 'field_order': [] }, '53': { 'type': 'message', 'message_typedef': { '1': { 'type': 'string' }, '5': { 'type': 'int' } }, 'field_order': [ '1', '5' ] }, '110': { 'type': 'message', 'message_typedef': { '1': { 'type': 'message', 'message_typedef': { '9': { 'type': 'message', 'message_typedef': {}, 'field_order': [] }, '20': { 'type': 'message', 'message_typedef': { '1': { 'type': 'int' } }, 'field_order': [ '1' ] } }, 'field_order': [ '9', '20' ] } }, 'field_order': [ '1' ] } } three = getBase64Protobuf(message, typedef) message = { '80226972': { '2': 'UCpVm7bg6pXKo1Pr6k5kxG9A', '3': three, } } typedef = { '80226972': { 'type': 'message', 'message_typedef': { '2': { 'type': 'string' }, '3': { 'type': 'string' }, }, 'field_order': [ '2', '3', ] } } continuation = getBase64Protobuf(message, typedef) url = 'https://www.youtube.com/youtubei/v1/browse' headers = { 'Content-Type': 'application/json' } data = { 'context': { 'client': { 'clientName': 'WEB', 'clientVersion': '2.20240313.05.00' } }, 'continuation': continuation } data = requests.post(url, headers = headers, json = data).json() print('Polar Bear Day' in str(data)) ```Python script second level decoded Protobuf simplified:
```py import requests import blackboxprotobuf import base64 def getBase64Protobuf(message, typedef): data = blackboxprotobuf.encode_message(message, typedef) return base64.b64encode(data, altchars = b'-/').decode('ascii') message = { '2': 'community', '53': { '1': 'Q2c5RFNVTmpiVXRCUjBWTGFsWm9Wa0VRQUE=', }, } typedef = { '2': { 'type': 'string' }, '53': { 'type': 'message', 'message_typedef': { '1': { 'type': 'string' }, }, }, } three = getBase64Protobuf(message, typedef) message = { '80226972': { '2': 'UCpVm7bg6pXKo1Pr6k5kxG9A', '3': three, } } typedef = { '80226972': { 'type': 'message', 'message_typedef': { '2': { 'type': 'string' }, '3': { 'type': 'string' }, }, 'field_order': [ '2', '3', ] } } continuation = getBase64Protobuf(message, typedef) url = 'https://www.youtube.com/youtubei/v1/browse' headers = { 'Content-Type': 'application/json' } data = { 'context': { 'client': { 'clientName': 'WEB', 'clientVersion': '2.20240313.05.00' } }, 'continuation': continuation } data = requests.post(url, headers = headers, json = data).json() print('Polar Bear Day' in str(data)) ```Getting:
Base64 decoding leads to
Cg9DSUNjbUtBR0VLalZoVkEQAA
, then possibly to something likeCICcmKAGEKjVhVA
but then the base64 decoded does not seem to make much sense.Related to #256, #153, https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4#issuecomment-1642445988,