debakarr / kodekloud-downloader

Simple downloaded for https://kodekloud.com/
138 stars 43 forks source link

Wrong scrapped videos urls, extension for #23 #26

Closed Ziad-Tawfik closed 1 year ago

Ziad-Tawfik commented 1 year ago

Hello Debakarr,

thanks for your reply and fast update for the code :) however I replied to the issue but it was closed thanks to find below findings.

I commented out this part of logic so the bot is working without failure. raise SystemExit( "Your cookie might have expired or you don't have access to the course." "\nPlease refresh/regenerate the cookie or enroll in the course and try again." )

however the main problem was in scraping the data as the value of below two variables (main_lesson__content & topics) don't include the correct values for videos when using soup.find and zip function

main_lesson_content = soup.find("div", class_="lessons_main__content") or soup.find("div", class_="ld-lesson-list") topics = main_lesson_content.find_all("div", class_="w-dyn-item") or main_lesson_content.find_all( "div", class_="ld-item-list-items")

I investigated and printed both of them found that scraped part of the "billing and pricing" topic is zipped again with urls of the previous part which is "Technology - Part Two", so this raises an error however the above fix will just create a folder of "billing and pricing" topic but downloads all the videos in the "Technology - Part Two" again

that's why the videos are appeared to be duplicated, is there any way to fix this ?

Thanks in advance, you've done a great work though!

debakarr commented 1 year ago

oh, thanks for explanation. I am still in office. I will check once I return back.

Tisona commented 1 year ago

Commenting out this part will not let you download Billing and Pricing videos because Kodekloud has bug in this course table of contents - instead of "Billing and Pricing" videos there are links to "Technology - Part Two" ones. You can check it yourself with curl.

Same problem was described here: https://github.com/debakarr/kodekloud-downloader/issues/9

Ziad-Tawfik commented 1 year ago

@Tisona Yes, I mentioned that it has downloaded the same content of "Technology - Part Two twice" but the bot continued to work and didn't stop. When I inspect the page with browser I can't see any problem with "Billing and Pricing" part, it looks the same as the previous parts

the same problem also with "Docker-vs-ContainerD (13:05)" in Core Concepts in the below course https://kodekloud.com/courses/certified-kubernetes-administrator-cka/

Tisona commented 1 year ago

Downloader does not use browser, use curl to check what downloader actually receives. Also check issue I mentioned above, this will make things clearer.

debakarr commented 1 year ago

The issue looks like to be with the request made without login for the course page does not show the same content as when the request is made while login. Opening https://kodekloud.com/courses/aws-cloud-practitioner/ in Incognito:

Very strange.

image

What I quickly tried is copying the curl command using network tab (when I was logged in): image

Using curlconverter to convert that into Python code: https://curlconverter.com/python/

and then check if the video is coming in the response body: image

So, looks like if we can do requests using cookie, that might help. But the response body is a bit different then the one we get without any auth or cookie, So the class name to parse the topic and lesson need to change.

debakarr commented 1 year ago

Issue got autoclose but you can reopen if issue is not fixed.