anilabhadatta / educative.io_scraper

Educative.io Course Downloader developed using Python and Selenium. Refer Readme.md for setup instructions.
MIT License
167 stars 55 forks source link

Crash #66

Closed MarkDHarris closed 11 months ago

MarkDHarris commented 1 year ago

On windows 11, using pyenv and python 3.11.2. I just cloned the latest and downloaded the chrome driver and binary.

Experience the following crash on every course tried.

ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 87: ExtensionScraper:scrapeTopic: 102: SeleniumBasicUtility:waitWebdriverToLoadTopicPage: 57: Message: Stacktrace: GetHandleVerifier [0x00007FF6535A8EF2+54786] (No symbol) [0x00007FF653515612] (No symbol) [0x00007FF6533CA64B] (No symbol) [0x00007FF65340B79C] (No symbol) [0x00007FF65340B91C] (No symbol) [0x00007FF653446D87] (No symbol) [0x00007FF65342BEAF] (No symbol) [0x00007FF653444D02] (No symbol) [0x00007FF65342BC43] (No symbol) [0x00007FF653400941] (No symbol) [0x00007FF653401B84] GetHandleVerifier [0x00007FF6538F7F52+3524194] GetHandleVerifier [0x00007FF65394D800+3874576] GetHandleVerifier [0x00007FF653945D7F+3843215] GetHandleVerifier [0x00007FF653645086+694166] (No symbol) [0x00007FF653520A88] (No symbol) [0x00007FF65351CA94] (No symbol) [0x00007FF65351CBC2] (No symbol) [0x00007FF65350CC83] BaseThreadInitThunk [0x00007FFDE605257D+29] RtlUserThreadStart [0x00007FFDE82CAA78+40]

2023-10-18 08:48:11,034 - DEBUG - StartScraper - Exiting Scraper...

anilabhadatta commented 1 year ago

@MarkDHarris send me the topic link. Retry using python3.9

MarkDHarris commented 1 year ago

First 2: https://www.educative.io/courses/grokking-coding-interview-patterns-java https://www.educative.io/courses/big-o-notation-for-interviews-and-beyond

using python 3.9.13

2023-10-18 10:27:23,101 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 87: ExtensionScraper:scrapeTopic: 102: SeleniumBasicUtility:waitWebdriverToLoadTopicPage: 57: Message: Stacktrace: GetHandleVerifier [0x00007FF6535A8EF2+54786] (No symbol) [0x00007FF653515612] (No symbol) [0x00007FF6533CA64B] (No symbol) [0x00007FF65340B79C] (No symbol) [0x00007FF65340B91C] (No symbol) [0x00007FF653446D87] (No symbol) [0x00007FF65342BEAF] (No symbol) [0x00007FF653444D02] (No symbol) [0x00007FF65342BC43] (No symbol) [0x00007FF653400941] (No symbol) [0x00007FF653401B84] GetHandleVerifier [0x00007FF6538F7F52+3524194] GetHandleVerifier [0x00007FF65394D800+3874576] GetHandleVerifier [0x00007FF653945D7F+3843215] GetHandleVerifier [0x00007FF653645086+694166] (No symbol) [0x00007FF653520A88] (No symbol) [0x00007FF65351CA94] (No symbol) [0x00007FF65351CBC2] (No symbol) [0x00007FF65350CC83] BaseThreadInitThunk [0x00007FFDE605257D+29] RtlUserThreadStart [0x00007FFDE82CAA78+40]

2023-10-18 10:27:23,101 - DEBUG - StartScraper - Exiting Scraper...

using python 3.9.0

2023-10-18 10:31:14,149 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 87: ExtensionScraper:scrapeTopic: 102: SeleniumBasicUtility:waitWebdriverToLoadTopicPage: 57: Message: Stacktrace: GetHandleVerifier [0x00007FF6535A8EF2+54786] (No symbol) [0x00007FF653515612] (No symbol) [0x00007FF6533CA64B] (No symbol) [0x00007FF65340B79C] (No symbol) [0x00007FF65340B91C] (No symbol) [0x00007FF653446D87] (No symbol) [0x00007FF65342BEAF] (No symbol) [0x00007FF653444D02] (No symbol) [0x00007FF65342BC43] (No symbol) [0x00007FF653400941] (No symbol) [0x00007FF653401B84] GetHandleVerifier [0x00007FF6538F7F52+3524194] GetHandleVerifier [0x00007FF65394D800+3874576] GetHandleVerifier [0x00007FF653945D7F+3843215] GetHandleVerifier [0x00007FF653645086+694166] (No symbol) [0x00007FF653520A88] (No symbol) [0x00007FF65351CA94] (No symbol) [0x00007FF65351CBC2] (No symbol) [0x00007FF65350CC83] BaseThreadInitThunk [0x00007FFDE605257D+29] RtlUserThreadStart [0x00007FFDE82CAA78+40]

2023-10-18 10:31:14,149 - DEBUG - StartScraper - Exiting Scraper...

MarkDHarris commented 1 year ago

I can also add that I first had to remove/comment the 2 lines at line# 72 and 73 in ExtensionScraperMain.py.

        if len(courseCollectionsJson["topicApiUrlList"]) != len(courseTopicUrlsList):
            raise Exception("CourseCollectionsJson and CourseTopicUrlsList Urls are not equal")

It stops there with this exception.

MarkDHarris commented 1 year ago

Additionally, seem to have the same error on RHEL 9.2 using python 3.9.16..

2023-10-18 15:32:44,824 - INFO - ExtensionScraper - API Urls: 398 == 401 :Topic Urls 2023-10-18 15:32:44,824 - INFO - ExtensionScraper - Scraping 0-course overview: https://www.educative.io/courses/grokking-coding-interview-patterns-java?showContent=true 2023-10-18 15:32:44,824 - INFO - LoginAccount - Checking if logged in... 2023-10-18 15:32:44,829 - INFO - ApiUtility - Getting Course API Content JSON from URL: https://educative.io/api/collection/10370001/4651429556125696/page/4851680091045888?work_type=collection 2023-10-18 15:32:44,829 - INFO - ApiUtility - Executing JS to get JSON from URL 2023-10-18 15:32:44,923 - INFO - ApiUtility - Successfully fetched JSON API data 2023-10-18 15:32:45,442 - INFO - SeleniumBasicUtility - Loading page and checking if something went wrong 2023-10-18 15:32:55,480 - INFO - SeleniumBasicUtility - Waiting for webdriver to load topic page 2023-10-18 15:33:15,901 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 87: ExtensionScraper:scrapeTopic: 102: SeleniumBasicUtility:waitWebdriverToLoadTopicPage: 57: Message: Stacktrace:

0 0x55c799fbdfb3

1 0x55c799c914a7

2 0x55c799cd8dd6

3 0x55c799cd8ec1

4 0x55c799d16354

5 0x55c799cfa96d

6 0x55c799d13c02

7 0x55c799cfa713

8 0x55c799ccd18b

9 0x55c799ccdf7e

10 0x55c799f838d8

11 0x55c799f87800

12 0x55c799f91cfc

13 0x55c799f88418

14 0x55c799f5542f

15 0x55c799fac4e8

16 0x55c799fac6b4

17 0x55c799fbd143

18 0x7fd2d6a9f802 start_thread

2023-10-18 15:33:15,901 - DEBUG - StartScraper - Exiting Scraper...

anilabhadatta commented 1 year ago

@MarkDHarris i found the error, you have to provide url of a topic from a course, at this moment you are pasting the course url

anilabhadatta commented 1 year ago

@MarkDHarris should be fixed, please check. also refer the previous comment. eg text file

image
MarkDHarris commented 11 months ago

Thanks @anilabhadatta !

I am getting a possible new error:

2023-11-16 16:18:02,712 - INFO - ApiUtility - Getting Course API Content JSON from URL: https://educative.io/api/collection/10370001/4651429556125696/page/6527799752130560?work_type=collection 2023-11-16 16:18:02,712 - INFO - ApiUtility - Executing JS to get JSON from URL 2023-11-16 16:18:02,866 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 86: ApiUtility:getCourseApiContentJson: 48: Message: javascript error: Unexpected token '<', "<!DOCTYPE "... is not valid JSON (Session info: chrome=119.0.6045.105) Stacktrace: GetHandleVerifier [0x00007FF7BD8682B2+55298] (No symbol) [0x00007FF7BD7D5E02] (No symbol) [0x00007FF7BD6905AB] (No symbol) [0x00007FF7BD69509C] (No symbol) [0x00007FF7BD69732A] (No symbol) [0x00007FF7BD70B12B] (No symbol) [0x00007FF7BD6F20AA] (No symbol) [0x00007FF7BD70AAA4] (No symbol) [0x00007FF7BD6F1E83] (No symbol) [0x00007FF7BD6C670A] (No symbol) [0x00007FF7BD6C7964] GetHandleVerifier [0x00007FF7BDBE0AAB+3694587] GetHandleVerifier [0x00007FF7BDC3728E+4048862] GetHandleVerifier [0x00007FF7BDC2F173+4015811] GetHandleVerifier [0x00007FF7BD9047D6+695590] (No symbol) [0x00007FF7BD7E0CE8] (No symbol) [0x00007FF7BD7DCF34] (No symbol) [0x00007FF7BD7DD062] (No symbol) [0x00007FF7BD7CD3A3] BaseThreadInitThunk [0x00007FFE4C28257D+29] RtlUserThreadStart [0x00007FFE4E34AA58+40]

2023-11-16 16:18:02,866 - DEBUG - StartScraper - Exiting Scraper...

And when it crashes like this, and if I restart it, it starts at 0 again.

Is there a way to restart it from before the crash and determine why it crashed? (any logs?)

MarkDHarris commented 11 months ago

This time, my courses.txt contained only this: https://www.educative.io/courses/grokking-coding-interview-patterns-java/course-overview

And it collected 0-26 before crashing.

anilabhadatta commented 11 months ago

@MarkDHarris can you attach the log file?

MarkDHarris commented 11 months ago

Thanks for taking a look @anilabhadatta.

Complete log is attached: EducativeScraper.log

And this is the full content of my Courses.txt: (its just one line for the one course) https://www.educative.io/courses/grokking-coding-interview-patterns-java/course-overview

anilabhadatta commented 11 months ago

@MarkDHarris can you redownload the chrome and also use the latest pull I generally suggest using chrome 116 instead of the current versions due to cloudflare issues. The region where you faced error should not occur technically, maybe restarting the scraper from that url and using v116 chrome should work properly. I have already mentioned the chrome v116 url in config so just click on download chrome binary and chrome driver once again to reset the chrome versions.

MarkDHarris commented 11 months ago

Thanks @anilabhadatta!

First, I noticed that it is downloading chrome 119 not 116 as you requested.

2023-11-27 08:31:22,129 - INFO - DownloadUtility -   Downloading Chrome Driver and Extracting..
                                URL: https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/win64/chromedriver-win64.zip
                                Output Path: D:\CODE\educative.io_scraper\src\ChromeDrivers\win
                                OS: win

...
 2023-11-27 08:31:23,673 - INFO - DownloadUtility -   Downloading Chrome Binary and Extracting..
                                URL: https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/win64/chrome-win64.zip
                                Output Path: D:\CODE\educative.io_scraper\src\ChromeBinary\win
                                OS: win 

Even after removing download-api = https://googlechromelabs.github.io/chrome-for-testing/last-known-good-versions-with-downloads.json from config.ini.

So I manually downloaded and extracted the locations shown in the log output and then noticed in the logs its now using the v117.

chromedriver-win64 = https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/117.0.5938.149/win64/chromedriver-win64.zip
chrome-win64 = https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/117.0.5938.149/win64/chrome-win64.zip

However, I noticed I still had a cloudfare issue when I logged in. I was able to manually resolve that after many attempts.

Second, then I allowed it to run, again with the url https://www.educative.io/courses/grokking-coding-interview-patterns-java/course-overview (should I be using a different one?) but it again failed at roughly the same point (#27).

The last few entries in the log are:

2023-11-27 09:08:55,323 - INFO - LoginAccount - Checking if logged in...
 2023-11-27 09:08:55,327 - INFO - ApiUtility - Getting Course API Content JSON from URL: https://educative.io/api/collection/10370001/4651429556125696/page/5359826647646208?work_type=collection
 2023-11-27 09:08:55,328 - INFO - ApiUtility - Executing JS to get JSON from URL
 2023-11-27 09:08:55,449 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 86: ApiUtility:getCourseApiContentJson: 48: Message: javascript error: Unexpected token '<', "<!DOCTYPE "... is not valid JSON
  (Session info: chrome=117.0.5938.149)
Stacktrace:
        GetHandleVerifier [0x00007FF687827D12+55474]
        (No symbol) [0x00007FF6877977C2]
        (No symbol) [0x00007FF68764E0EB]
        (No symbol) [0x00007FF687652B1D]
        (No symbol) [0x00007FF68765495A]
        (No symbol) [0x00007FF6876C8584]
        (No symbol) [0x00007FF6876AF15A]
        (No symbol) [0x00007FF6876C7EF2]
        (No symbol) [0x00007FF6876AEF33]
        (No symbol) [0x00007FF687683D41]
        (No symbol) [0x00007FF687684F84]
        GetHandleVerifier [0x00007FF687B8B762+3609346]
        GetHandleVerifier [0x00007FF687BE1A80+3962400]
        GetHandleVerifier [0x00007FF687BD9F0F+3930799]
        GetHandleVerifier [0x00007FF6878C3CA6+694342]
        (No symbol) [0x00007FF6877A2218]
        (No symbol) [0x00007FF68779E484]
        (No symbol) [0x00007FF68779E5B2]
        (No symbol) [0x00007FF68778EE13]
        BaseThreadInitThunk [0x00007FFF2F4B257D+29]
        RtlUserThreadStart [0x00007FFF3082AA58+40]

 2023-11-27 09:08:55,449 - DEBUG - StartScraper - Exiting Scraper...

Can you confirm if you are able to save this specific course? Is this "not valid json" actually a legitimate issue when its attempting to parse json or is there still an issue with the version of chrome I'm using and I should try something older than v117? Can you also please confirm if this is the correct url https://www.educative.io/courses/grokking-coding-interview-patterns-java/course-overview ?

Thanks for your help!

anilabhadatta commented 11 months ago

@MarkDHarris i can confirm when i tested the url where you got error previously, i was able to download the file. I got the topic url from the log file.

Normally what happens, when you open a url from chrome in selenium, latest version are causing cloudflare issues. So version 116 should work properly.

I think you are using old version of scraper, can you tell the current version of scraper. Also please reclone the current repo to a new location and try again.

I can also suggest you a testing method. Req: v116 chrome, current repo, login the account. In that window, open the api url in a new tab to see if you are able to view any data. Also do note, if you get any cloudflare captcha do report here.

MarkDHarris commented 11 months ago

I cloned to a new machine and did not encounter the need to manually downgrade to chrome 116. Things worked much better. It crashed partway through but I was able to resume from there and complete. Thanks @anilabhadatta !