Closed MarkDHarris closed 11 months ago
@MarkDHarris send me the topic link. Retry using python3.9
First 2: https://www.educative.io/courses/grokking-coding-interview-patterns-java https://www.educative.io/courses/big-o-notation-for-interviews-and-beyond
using python 3.9.13
2023-10-18 10:27:23,101 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 87: ExtensionScraper:scrapeTopic: 102: SeleniumBasicUtility:waitWebdriverToLoadTopicPage: 57: Message: Stacktrace: GetHandleVerifier [0x00007FF6535A8EF2+54786] (No symbol) [0x00007FF653515612] (No symbol) [0x00007FF6533CA64B] (No symbol) [0x00007FF65340B79C] (No symbol) [0x00007FF65340B91C] (No symbol) [0x00007FF653446D87] (No symbol) [0x00007FF65342BEAF] (No symbol) [0x00007FF653444D02] (No symbol) [0x00007FF65342BC43] (No symbol) [0x00007FF653400941] (No symbol) [0x00007FF653401B84] GetHandleVerifier [0x00007FF6538F7F52+3524194] GetHandleVerifier [0x00007FF65394D800+3874576] GetHandleVerifier [0x00007FF653945D7F+3843215] GetHandleVerifier [0x00007FF653645086+694166] (No symbol) [0x00007FF653520A88] (No symbol) [0x00007FF65351CA94] (No symbol) [0x00007FF65351CBC2] (No symbol) [0x00007FF65350CC83] BaseThreadInitThunk [0x00007FFDE605257D+29] RtlUserThreadStart [0x00007FFDE82CAA78+40]
2023-10-18 10:27:23,101 - DEBUG - StartScraper - Exiting Scraper...
using python 3.9.0
2023-10-18 10:31:14,149 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 87: ExtensionScraper:scrapeTopic: 102: SeleniumBasicUtility:waitWebdriverToLoadTopicPage: 57: Message: Stacktrace: GetHandleVerifier [0x00007FF6535A8EF2+54786] (No symbol) [0x00007FF653515612] (No symbol) [0x00007FF6533CA64B] (No symbol) [0x00007FF65340B79C] (No symbol) [0x00007FF65340B91C] (No symbol) [0x00007FF653446D87] (No symbol) [0x00007FF65342BEAF] (No symbol) [0x00007FF653444D02] (No symbol) [0x00007FF65342BC43] (No symbol) [0x00007FF653400941] (No symbol) [0x00007FF653401B84] GetHandleVerifier [0x00007FF6538F7F52+3524194] GetHandleVerifier [0x00007FF65394D800+3874576] GetHandleVerifier [0x00007FF653945D7F+3843215] GetHandleVerifier [0x00007FF653645086+694166] (No symbol) [0x00007FF653520A88] (No symbol) [0x00007FF65351CA94] (No symbol) [0x00007FF65351CBC2] (No symbol) [0x00007FF65350CC83] BaseThreadInitThunk [0x00007FFDE605257D+29] RtlUserThreadStart [0x00007FFDE82CAA78+40]
2023-10-18 10:31:14,149 - DEBUG - StartScraper - Exiting Scraper...
I can also add that I first had to remove/comment the 2 lines at line# 72 and 73 in ExtensionScraperMain.py.
if len(courseCollectionsJson["topicApiUrlList"]) != len(courseTopicUrlsList):
raise Exception("CourseCollectionsJson and CourseTopicUrlsList Urls are not equal")
It stops there with this exception.
Additionally, seem to have the same error on RHEL 9.2 using python 3.9.16..
2023-10-18 15:32:44,824 - INFO - ExtensionScraper - API Urls: 398 == 401 :Topic Urls 2023-10-18 15:32:44,824 - INFO - ExtensionScraper - Scraping 0-course overview: https://www.educative.io/courses/grokking-coding-interview-patterns-java?showContent=true 2023-10-18 15:32:44,824 - INFO - LoginAccount - Checking if logged in... 2023-10-18 15:32:44,829 - INFO - ApiUtility - Getting Course API Content JSON from URL: https://educative.io/api/collection/10370001/4651429556125696/page/4851680091045888?work_type=collection 2023-10-18 15:32:44,829 - INFO - ApiUtility - Executing JS to get JSON from URL 2023-10-18 15:32:44,923 - INFO - ApiUtility - Successfully fetched JSON API data 2023-10-18 15:32:45,442 - INFO - SeleniumBasicUtility - Loading page and checking if something went wrong 2023-10-18 15:32:55,480 - INFO - SeleniumBasicUtility - Waiting for webdriver to load topic page 2023-10-18 15:33:15,901 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 87: ExtensionScraper:scrapeTopic: 102: SeleniumBasicUtility:waitWebdriverToLoadTopicPage: 57: Message: Stacktrace:
2023-10-18 15:33:15,901 - DEBUG - StartScraper - Exiting Scraper...
@MarkDHarris i found the error, you have to provide url of a topic from a course, at this moment you are pasting the course url
@MarkDHarris should be fixed, please check. also refer the previous comment. eg text file
Thanks @anilabhadatta !
I am getting a possible new error:
2023-11-16 16:18:02,712 - INFO - ApiUtility - Getting Course API Content JSON from URL: https://educative.io/api/collection/10370001/4651429556125696/page/6527799752130560?work_type=collection 2023-11-16 16:18:02,712 - INFO - ApiUtility - Executing JS to get JSON from URL 2023-11-16 16:18:02,866 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 86: ApiUtility:getCourseApiContentJson: 48: Message: javascript error: Unexpected token '<', "<!DOCTYPE "... is not valid JSON (Session info: chrome=119.0.6045.105) Stacktrace: GetHandleVerifier [0x00007FF7BD8682B2+55298] (No symbol) [0x00007FF7BD7D5E02] (No symbol) [0x00007FF7BD6905AB] (No symbol) [0x00007FF7BD69509C] (No symbol) [0x00007FF7BD69732A] (No symbol) [0x00007FF7BD70B12B] (No symbol) [0x00007FF7BD6F20AA] (No symbol) [0x00007FF7BD70AAA4] (No symbol) [0x00007FF7BD6F1E83] (No symbol) [0x00007FF7BD6C670A] (No symbol) [0x00007FF7BD6C7964] GetHandleVerifier [0x00007FF7BDBE0AAB+3694587] GetHandleVerifier [0x00007FF7BDC3728E+4048862] GetHandleVerifier [0x00007FF7BDC2F173+4015811] GetHandleVerifier [0x00007FF7BD9047D6+695590] (No symbol) [0x00007FF7BD7E0CE8] (No symbol) [0x00007FF7BD7DCF34] (No symbol) [0x00007FF7BD7DD062] (No symbol) [0x00007FF7BD7CD3A3] BaseThreadInitThunk [0x00007FFE4C28257D+29] RtlUserThreadStart [0x00007FFE4E34AA58+40]
2023-11-16 16:18:02,866 - DEBUG - StartScraper - Exiting Scraper...
And when it crashes like this, and if I restart it, it starts at 0 again.
Is there a way to restart it from before the crash and determine why it crashed? (any logs?)
This time, my courses.txt contained only this: https://www.educative.io/courses/grokking-coding-interview-patterns-java/course-overview
And it collected 0-26 before crashing.
@MarkDHarris can you attach the log file?
Thanks for taking a look @anilabhadatta.
Complete log is attached: EducativeScraper.log
And this is the full content of my Courses.txt: (its just one line for the one course) https://www.educative.io/courses/grokking-coding-interview-patterns-java/course-overview
@MarkDHarris can you redownload the chrome and also use the latest pull I generally suggest using chrome 116 instead of the current versions due to cloudflare issues. The region where you faced error should not occur technically, maybe restarting the scraper from that url and using v116 chrome should work properly. I have already mentioned the chrome v116 url in config so just click on download chrome binary and chrome driver once again to reset the chrome versions.
Thanks @anilabhadatta!
First, I noticed that it is downloading chrome 119 not 116 as you requested.
2023-11-27 08:31:22,129 - INFO - DownloadUtility - Downloading Chrome Driver and Extracting..
URL: https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/win64/chromedriver-win64.zip
Output Path: D:\CODE\educative.io_scraper\src\ChromeDrivers\win
OS: win
...
2023-11-27 08:31:23,673 - INFO - DownloadUtility - Downloading Chrome Binary and Extracting..
URL: https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/win64/chrome-win64.zip
Output Path: D:\CODE\educative.io_scraper\src\ChromeBinary\win
OS: win
Even after removing download-api = https://googlechromelabs.github.io/chrome-for-testing/last-known-good-versions-with-downloads.json
from config.ini.
So I manually downloaded and extracted the locations shown in the log output and then noticed in the logs its now using the v117.
chromedriver-win64 = https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/117.0.5938.149/win64/chromedriver-win64.zip
chrome-win64 = https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/117.0.5938.149/win64/chrome-win64.zip
However, I noticed I still had a cloudfare issue when I logged in. I was able to manually resolve that after many attempts.
Second, then I allowed it to run, again with the url https://www.educative.io/courses/grokking-coding-interview-patterns-java/course-overview
(should I be using a different one?) but it again failed at roughly the same point (#27).
The last few entries in the log are:
2023-11-27 09:08:55,323 - INFO - LoginAccount - Checking if logged in...
2023-11-27 09:08:55,327 - INFO - ApiUtility - Getting Course API Content JSON from URL: https://educative.io/api/collection/10370001/4651429556125696/page/5359826647646208?work_type=collection
2023-11-27 09:08:55,328 - INFO - ApiUtility - Executing JS to get JSON from URL
2023-11-27 09:08:55,449 - ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 86: ApiUtility:getCourseApiContentJson: 48: Message: javascript error: Unexpected token '<', "<!DOCTYPE "... is not valid JSON
(Session info: chrome=117.0.5938.149)
Stacktrace:
GetHandleVerifier [0x00007FF687827D12+55474]
(No symbol) [0x00007FF6877977C2]
(No symbol) [0x00007FF68764E0EB]
(No symbol) [0x00007FF687652B1D]
(No symbol) [0x00007FF68765495A]
(No symbol) [0x00007FF6876C8584]
(No symbol) [0x00007FF6876AF15A]
(No symbol) [0x00007FF6876C7EF2]
(No symbol) [0x00007FF6876AEF33]
(No symbol) [0x00007FF687683D41]
(No symbol) [0x00007FF687684F84]
GetHandleVerifier [0x00007FF687B8B762+3609346]
GetHandleVerifier [0x00007FF687BE1A80+3962400]
GetHandleVerifier [0x00007FF687BD9F0F+3930799]
GetHandleVerifier [0x00007FF6878C3CA6+694342]
(No symbol) [0x00007FF6877A2218]
(No symbol) [0x00007FF68779E484]
(No symbol) [0x00007FF68779E5B2]
(No symbol) [0x00007FF68778EE13]
BaseThreadInitThunk [0x00007FFF2F4B257D+29]
RtlUserThreadStart [0x00007FFF3082AA58+40]
2023-11-27 09:08:55,449 - DEBUG - StartScraper - Exiting Scraper...
Can you confirm if you are able to save this specific course?
Is this "not valid json" actually a legitimate issue when its attempting to parse json or is there still an issue with the version of chrome I'm using and I should try something older than v117?
Can you also please confirm if this is the correct url https://www.educative.io/courses/grokking-coding-interview-patterns-java/course-overview
?
Thanks for your help!
@MarkDHarris i can confirm when i tested the url where you got error previously, i was able to download the file. I got the topic url from the log file.
Normally what happens, when you open a url from chrome in selenium, latest version are causing cloudflare issues. So version 116 should work properly.
I think you are using old version of scraper, can you tell the current version of scraper. Also please reclone the current repo to a new location and try again.
I can also suggest you a testing method. Req: v116 chrome, current repo, login the account. In that window, open the api url in a new tab to see if you are able to view any data. Also do note, if you get any cloudflare captcha do report here.
I cloned to a new machine and did not encounter the need to manually downgrade to chrome 116. Things worked much better. It crashed partway through but I was able to resume from there and complete. Thanks @anilabhadatta !
On windows 11, using pyenv and python 3.11.2. I just cloned the latest and downloaded the chrome driver and binary.
Experience the following crash on every course tried.
ERROR - StartScraper - start: 20: ExtensionScraper:start: 49: ExtensionScraper:scrapeCourse: 87: ExtensionScraper:scrapeTopic: 102: SeleniumBasicUtility:waitWebdriverToLoadTopicPage: 57: Message: Stacktrace: GetHandleVerifier [0x00007FF6535A8EF2+54786] (No symbol) [0x00007FF653515612] (No symbol) [0x00007FF6533CA64B] (No symbol) [0x00007FF65340B79C] (No symbol) [0x00007FF65340B91C] (No symbol) [0x00007FF653446D87] (No symbol) [0x00007FF65342BEAF] (No symbol) [0x00007FF653444D02] (No symbol) [0x00007FF65342BC43] (No symbol) [0x00007FF653400941] (No symbol) [0x00007FF653401B84] GetHandleVerifier [0x00007FF6538F7F52+3524194] GetHandleVerifier [0x00007FF65394D800+3874576] GetHandleVerifier [0x00007FF653945D7F+3843215] GetHandleVerifier [0x00007FF653645086+694166] (No symbol) [0x00007FF653520A88] (No symbol) [0x00007FF65351CA94] (No symbol) [0x00007FF65351CBC2] (No symbol) [0x00007FF65350CC83] BaseThreadInitThunk [0x00007FFDE605257D+29] RtlUserThreadStart [0x00007FFDE82CAA78+40]
2023-10-18 08:48:11,034 - DEBUG - StartScraper - Exiting Scraper...