Add the ability to generate web pages

killlowkey commented 2 years ago

Hello, thank you for creating an excellent project, it is very helpful to me. I have an idea. You can add the ability to generate pages with links similar to those below, Generating images is not conducive to reading. url-shortening-service-like-tiny-url Thank you very much.

anilabhadatta commented 2 years ago

@killlowkey I thought of that method but there were some images which doesn't show up in HTML. This is why I had to take screenshot of each webpage. Also you cannot be sure when educative may change the URLs of certain images, which in turn may effect the offline HTML file. Example:

I would suggest you to use educative-viewer to view the scraped courses as it is also designed for mobile view. You can use some OCR to text extensions to copy text from images.

killlowkey commented 2 years ago

@anilabhadatta I might have a solution to this problem by trying to convert the image to Base64 encoding, as shown below

Snipaste_2022-06-10_22-05-12 After converted into base64 Snipaste_2022-06-10_22-05-29 This should work

anilabhadatta commented 2 years ago

@killlowkey try to implement it

killlowkey commented 2 years ago

@killlowkey try to implement it

thank you

BoostUpStation commented 2 years ago

@killlowkey try to implement it

thank you

Hi Killlowkey, any progress on saving webpage as html or mhtml ??

@anilabhadatta can you please try saving the webpage as .mhtml instead of taking screenshot?? would be really helpful. and please give some path and/or a series of courses which will make me a web-scrapper like you :) [i know python, like have done competitive programming using it. that's it.]

killlowkey commented 2 years ago

@killlowkey try to implement it

thank you

Hi Killlowkey, any progress on saving webpage as html or mhtml ??

@anilabhadatta can you please try saving the webpage as .mhtml instead of taking screenshot?? would be really helpful. and please give some path and/or a series of courses which will make me a web-scrapper like you :) [i know python, like have done competitive programming using it. that's it.]

@BoostUpStation I don’t found now. Image not be show if use browser saved webpage to html or mhtml.

BoostUpStation commented 2 years ago

@killlowkey try to implement it

thank you

Hi Killlowkey, any progress on saving webpage as html or mhtml ?? @anilabhadatta can you please try saving the webpage as .mhtml instead of taking screenshot?? would be really helpful. and please give some path and/or a series of courses which will make me a web-scrapper like you :) [i know python, like have done competitive programming using it. that's it.]

@BoostUpStation I don’t found now. Image not be show if use browser saved webpage to html or mhtml.

here's an old repo which saves in html/mhtml and pdf, but in typescript, don't know that :( https://github.com/MrAbdulQadeer/educative.io-downloader

hoping somebody can implement it here in Python :)

killlowkey commented 2 years ago

@killlowkey try to implement it

thank you

Hi Killlowkey, any progress on saving webpage as html or mhtml ?? @anilabhadatta can you please try saving the webpage as .mhtml instead of taking screenshot?? would be really helpful. and please give some path and/or a series of courses which will make me a web-scrapper like you :) [i know python, like have done competitive programming using it. that's it.]

@BoostUpStation I don’t found now. Image not be show if use browser saved webpage to html or mhtml.

here's an old repo which saves in html/mhtml and pdf, but in typescript, don't know that :( https://github.com/MrAbdulQadeer/educative.io-downloader

hoping somebody can implement it here in Python :)

@BoostUpStation I never use python and ts, so I don’t help you. I think key idea for save webpage to html or mhtml is convert image url to base64 encoding. Hope it helps you.

anilabhadatta commented 2 years ago

@BoostUpStation @killlowkey Mainly there are few svg tags which contain image URLs , so the main option is to find every image URL and convert to base64 and also keep a track of image tags inside svg's and show them up in mhtml.

https://www.educative.io/courses/operating-systems-virtualization-concurrency-persistence/3jj3lxm03xr

test URL where you can see the image wont show up in mhtml. if you find a way to show that up in mhtml manually placing it in the right place then I will see to it.

killlowkey commented 2 years ago

@BoostUpStation @killlowkey Mainly there are few svg tags which contain image URLs , so the main option is to find every image URL and convert to base64 and also keep a track of image tags inside svg's and show them up in mhtml.

https://www.educative.io/courses/operating-systems-virtualization-concurrency-persistence/3jj3lxm03xr

test URL where you can see the image wont show up in mhtml. if you find a way to show that up in mhtml manually placing it in the right place then I will see to it.

@anilabhadatta I can't test the URL currently because I don't have an Educative Pro account. You may be able to find an unlimited URL, let me see the effect

anilabhadatta commented 2 years ago

@killlowkey i will try to send a free course link having the same issue. here use this, https://www.educative.io/courses/getting-started-braintree-api/qABYKBmxEY0

killlowkey commented 2 years ago

@killlowkey i will try to send a free course link having the same issue. here use this, https://www.educative.io/courses/getting-started-braintree-api/qABYKBmxEY0

@anilabhadatta This is a tricky problem, I currently have no way to display SVG in mhtml.

anilabhadatta commented 2 years ago

@killlowkey yes, that is why i didn't implement it. Try, if you can find a way to show the SVG image element in mhtml. I also thought of saving HTML but that wont work actually due to styling issues. PDF is out of question since text may be missing or cut when there is a page break.

BoostUpStation commented 2 years ago

@killlowkey i will try to send a free course link having the same issue. here use this, https://www.educative.io/courses/getting-started-braintree-api/qABYKBmxEY0

@anilabhadatta This is a tricky problem, I currently have no way to display SVG in mhtml.

@anilabhadatta Its very simple with 1 stoppage, i.e. try to press ctrl+s in the webdriver opened chrome, and select 2nd option which is 'save as single file .mhtml' and press enter.

Now have to add these steps through scripting in python/js/html, so please do this. Rather than converting, decoding and encoding stuffs.

anilabhadatta commented 2 years ago

@BoostUpStation actually if you ctrl+s mhtml then you wont be able to see the image present inside a iframe > SVG also you also have to change each image URLs to base64, few of them maybe already converted. this is required because if educative changes its domain in future or the URL is updated to something new or your system is offline then the images wont load up. base64 ensures the image is available for offline usage

BoostUpStation commented 2 years ago

@anilabhadatta yes you are right, It didn't even work while saving the complete 'complete html with files included' none of those options work as expected. Didn't even work in android.

So base64 is the only way then apart from image. Hope you implement it :)

Pls see that past repo link i shared, he also took screenshot ig with some more implementation(typescript was used.), and in that even if we zoom more than 400%, quality remains the same and pixels doesn't tear apart.

BoostUpStation commented 2 years ago

@anilabhadatta here's some python code which will convert image to base64 and vice versa https://superuser.com/questions/263634/decoding-base64-images-and-saving-to-a-file

And the link to thosa svg's can be easily taken via js. By searching for 'data:' in the document.

anilabhadatta commented 2 years ago

@BoostUpStation the issue is not with finding base64 or conversion. The issue lies how to place a img element with base64 in mhtml in place of that iframe -> svg

anilabhadatta commented 2 years ago

@killlowkey @BoostUpStation New update. I did a testing on svg element images, seems like the images inside svg was never the problem, the {object tag and #document} was the main issue. I was able to change get the content inside #document and then put it above object tag.

ifrm = document.querySelectorAll("object[aria-label='svg viewer']")[0] svg_element = ifrm.contentDocument.documentElement ifrm.parentNode.append(svg_element) cls_name = ifrm.className svg_element.classList.add(cls_name)

Try this in your system chrome console and then save the file using SingleFile. I can iterate all the possible object tags and change the HTML. After conversion, I was thinking of using SingleFile HTML extension to save the HTML page because it automatically converts all the image URLs to base64 and also keeps the HTML intact. I need some help regarding this extension. if there is any way to call the single file extension using chrome console and get the scraped HTML file, then I can just add the quiz images and the scraping would be complete.

killlowkey commented 2 years ago

@anilabhadatta The idea of calling the chrome extension via JavaScript in the console and getting the output HTML can be difficult to implement. you can compare the HTML saved by the SingleFile HTML extension with the previously saved HTML to see how it displays the svg. hopefully this will help you

anilabhadatta commented 2 years ago

@killlowkey i will test this after few hours https://github.com/gildas-lormeau/SingleFile/issues/820

BoostUpStation commented 2 years ago

@anilabhadatta Yes after running the script in console and then using that extension, its saving all svg's in the html file. Its great. Now have to call that extension only. I'll also find something if can.

BoostUpStation commented 2 years ago

@anilabhadatta Getting this error when running the code in cosole for this link.

ifrm = document.querySelectorAll("object[aria-label='svg viewer']")[0] svg_element = ifrm.contentDocument.documentElement ifrm.parentNode.append(svg_element) cls_name = ifrm.className svg_element.classList.add(cls_name)

https://www.educative.io/courses/getting-started-braintree-api/x1BG30wrnol

Uncaught TypeError: Cannot read properties of undefined (reading 'contentDocument') at :2:20

BoostUpStation commented 2 years ago

So here we have to check if webpage has 'contentDocument' element or not. And it will work fine then.

BoostUpStation commented 2 years ago

@anilabhadatta you can do like this if it can work. Add the quizzes and other such elements under one another by modifying the current opened web page.

And then run that above 4 5 lines script, And then call that singlefile extension or implement its code from github.

anilabhadatta commented 2 years ago

So here we have to check if webpage has 'contentDocument' element or not. And it will work fine then. @BoostUpStation You receive the error because I already hardcoded ifrm for testing to take zero index node but queryselector will create empty list. When I will implement, I will create a loop and traverse the list so it wont create any error.

I will have to see the singlefile injection part. I was thinking of adding the quiz images after getting the HTML content from single file because my program is set to run like that else I have to change a lot of code. Also it may effect with code containers so better I can just get the HTML content using single file and then append all the quiz images. It is much safer in many ways. Also I would ask you to test the topic list URL traverse method and create a pull request and attach 2-3 course zip. After that, I will push my code or else again you may need to delete the fork and refork it

anilabhadatta commented 2 years ago

@BoostUpStation @killlowkey implementation successfully completed. launching a personal website using r.zip

killlowkey commented 2 years ago

@anilabhadatta It works perfectly. Nice.

anilabhadatta commented 2 years ago

@killlowkey @BoostUpStation will do some testing and then I will push it.

BoostUpStation commented 2 years ago

@anilabhadatta awesome. No issues, all working perfect.

You add the code, i'll refork it. Because sometimes in some urls, it exits, So after you have uploaded as of now latest single file html code. I'll test it and then will create pull request.

Waiting for code updation from your side.

And will the codes inside html be scrollable or still separate code files must be used to view the code?

anilabhadatta commented 2 years ago

@BoostUpStation code will not be scrollable because that is done dynamically from educative servers. I will recommend you to use educative-viewer to open code window and easier access to HTML files as well. I will push it after few hours. currently testing it

anilabhadatta commented 2 years ago

@killlowkey @BoostUpStation i have pushed the latest version, clone it and test it for few courses.

anilabhadatta commented 2 years ago

@killlowkey @BoostUpStation Refer v5.2 latest commit pushed few minutes ago

BoostUpStation commented 2 years ago

@anilabhadatta yes, i have pulled latest code, and testing it. Isn't it good to add singlefile script local path with the code rather than pulling it from git on the fly?

And what about when we have scraped courses, why would we scrape the same course when using the scraper for paths? As paths also have many/all same courses that are given as separate courses.

anilabhadatta commented 2 years ago

@BoostUpStation i tried local injection but failed so i am pulling it from git. (If you are able to implement it then you can commit it ) I have built the scraper to course URLs irrespective of single course or path . I have added a single condition for next button page to check if the page is the last page of that path so that scraper can exit. In paths generally, most of the content is the same except 1-2 pages or more I guess but the content is usually organized in paths and there is no need to manually check.

anilabhadatta commented 2 years ago

@BoostUpStation I have updated educative-viewer as well. Will show content in 100% zoom

BoostUpStation commented 2 years ago

@anilabhadatta so is it better to scrape single single courses or paths? And lets say if we scrape single single courses, then how to skip them when scraping paths? As don't want to download again.

anilabhadatta commented 2 years ago

@BoostUpStation basically the scraper needs the first topic url and index(for resume) You can just skip providing topic urls of paths modules in url list. Just go to educative.io/explore and copy all the topic url from each course and paste it in url.txt

anilabhadatta commented 2 years ago

@BoostUpStation if you want to check if the course if already downloaded so you don't want to scrape it again while scraping paths then you will need to manually remove those urls. Currently there is no way to check and skip those courses because the url as well as the name is different.

BoostUpStation commented 2 years ago

@anilabhadatta ok thanks, i'll try that in a few days. And could you please tell me what is included in the "code widget" folders? Because till now they were empty, i have tried like more than 10 courses. Like in this course topic. https://www.educative.io/courses/master-deno-javascript-runtime/3w7RNLk1W7p

anilabhadatta commented 2 years ago

@BoostUpStation codewidget may or maynot contain codes U will see there a widget will have output tab, so there is no code and that is why the folder is empty. But if there were multiple tabs then the folder would contain the codes. https://www.educative.io/module/lesson/ace-html/g2DpwW50279 test this link I found a bug, actually the widget type is also present inside code download type containers. i will fix it tonight

anilabhadatta commented 2 years ago

@BoostUpStation fixed and added a feature to collect data from runjs type containers. Very few text/output files may not be saved from widget-type containers since it is in the beta stage and I am not planning to fix that🤣because of high complexity cases. Although most of the content will be downloaded from widget type containers. Also you may see that HTML doesn't show output images that are present inside widgets , so I have tried to capture the images and add them to their respective widget folders. Test Link : https://www.educative.io/module/lesson/ace-html/g2DpwW50279 I wont be able to show the image in HTML itself since Iframe isn't allowing me to access it from outside (CORS issue).

BoostUpStation commented 2 years ago

@anilabhadatta whole html is text selectable except these runjs containers. Any possiblity to make them text selectable as well?

BoostUpStation commented 2 years ago

@anilabhadatta I found the solution for that, you just have to remove 'no-user-select' property from 'monaco-editor' class div. If the property exists, else continue. Try to implement this when saving singlefile, if not possible then have to edit html afterwards.

anilabhadatta commented 2 years ago

@BoostUpStation the whole code wont be available if the widget has a scroller

BoostUpStation commented 2 years ago

@anilabhadatta I have implemented it, will test and report.

BoostUpStation commented 2 years ago

@BoostUpStation the whole code wont be available if the widget has a scroller

Ya, but if it doesn't have scroller, then in that case it is more helpful and i have implemented it, if you allow?, i can create a pull request for just that.

anilabhadatta commented 2 years ago

@BoostUpStation create a pr then.

BoostUpStation commented 2 years ago

@anilabhadatta 1 issue when saving the single file. The 1st q of quizzes is repeated, And all quiizzes are added to the end of page rather than at their specific places one after other. And the screenshots of quizzes aren't zoom independent plus non selectable (but the 1st q of quizz is selectable as its taken with the single file script ig), see if anything can be done about them.

anilabhadatta commented 2 years ago

@BoostUpStation nothing can be done because i have to take screenshots of quiz and they are non selectable for that reason and let it repeat the 1st question, there maybe cases where the first question may not show in single file

anilabhadatta / educative.io_scraper

Add the ability to generate web pages #6