JuanBindez / pytubefix

Python3 library for downloading YouTube Videos.
http://pytubefix.rtfd.io/
MIT License
454 stars 67 forks source link

yt.publish_date returning blank value #138 revisit #196

Open wyattZarkLab opened 3 weeks ago

wyattZarkLab commented 3 weeks ago

Different results across OS & python versions on YouTube.from_id(video_id).publish_date attribute, not linked to [pytubefix](https://github.com/JuanBindez/pytubefix) 6.13.0 code, seems to come from watch_html value set by _execute_request call's use of from urllib.request import Request, urlopen... Large mismatch in length of watch_html noted while arguments in Request(url, headers=base_headers, method=method, data=data) call are identical...


# from urllib.request import Request, urlopen
from pytubefix import YouTube, extract
from datetime import datetime
import re

def main():
    video_id = 'BBxdzypLalg'
    url = f"https://www.youtube.com/watch?v={video_id}"
    # extract.publish_date = __publish_date
    vidObj = YouTube.from_id(video_id)
    print(len(vidObj.watch_html))
    print(vidObj.publish_date)
    print(type(vidObj.publish_date))
    print('# # # round 2')
    extract.publish_date = __publish_date
    vidObj = YouTube.from_id(video_id)
    print(len(vidObj.watch_html))
    print(vidObj.publish_date)
    print(type(vidObj.publish_date))
    return

def __publish_date(watch_html):
    print(f'len(watch_html):  {len(watch_html)}')
    try:
        REGEX_1 = r'(?<=itemprop="datePublished" content=")\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[+-]\d{2}:\d{2}'
        REGEX_2 = r'(?<="publishDate":{"simpleText":")\w+ \d{1,2}, \d{4}'
        REGEX_3 = r'(?<="publishDate":")\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[+-]\d{2}:\d{2}'
        result = re.search(
            REGEX_1,
            watch_html)
        if result:
            print(REGEX_1)
            # return datetime.fromisoformat(result.group(0))
        result = re.search(
            REGEX_2,
            watch_html)
        if result:
            print(REGEX_2)
            # return datetime.strptime(result.group(0), '%b %d, %Y')
        result = re.search(
            REGEX_3,
            watch_html)
        if result:
            print(REGEX_3)
            # return datetime.strptime(result.group(0), '%b %d, %Y')  
        return datetime.now() # # # dummy value...
    except AttributeError as e:
        print(e)
        return None

if __name__=='__main__':
    main()

yields, with with pytubefix 6.13.0 MacOS Monterey 12.7.6, python 3.11.8:


1033817
2024-08-21 07:32:12-07:00
<class 'datetime.datetime'>
# # # round 2
1035229
len(watch_html):  1035229
(?<=itemprop="datePublished" content=")\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[+-]\d{2}:\d{2}
(?<="publishDate":{"simpleText":")\w+ \d{1,2}, \d{4}
(?<="publishDate":")\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[+-]\d{2}:\d{2}
2024-08-21 13:42:29.095798
<class 'datetime.datetime'>

while with pytubefix 6.13.0 on python 3.11.2, Debian GNU/Linux 12 (bookworm) yields


975900
None
<class 'NoneType'>
# # # round 2
941545
len(watch_html):  941545
(?<="publishDate":{"simpleText":")\w+ \d{1,2}, \d{4}
2024-08-21 20:43:20.558797
<class 'datetime.datetime'>

Expected behavior Should yield same YouTube.from_id(video_id).publish_date attributes across platforms and python versions...


Desktop (please complete the following information): see above comparisons


Additional context Noted while setting up an automation script in AWS ec2, identical code not performing as it's performed in local environment.

wyattZarkLab commented 3 weeks ago

Oh boy... noticing my ec2 is also throwing BBxdzypLalg This request has been detected as a bot, please try again or log in to view 170 and 174, looks like the likely culprit is not python or OS version differences but flagged IP address...

... likely fine to mark as closed (?)