daijro / CourseHeroUnblur

(⚠️DISCONTINUED⚠️) PoC Page Stitcher Image Manipulation Tool
Apache License 2.0
38 stars 14 forks source link

Possible way to make a continuation #5

Open 7ih opened 1 year ago

7ih commented 1 year ago

I realized that when you search for text in the coursehero document on google, it has the problem and answer in the page description. That means coursehero has a text version of the pdf openly available for web crawlers.

1

I don't know how you would get the full thing though. Theres a json in the html with some but not all problems/answers. Maybe make a request to the page disguised as a google web crawler.

(edit i put wrong image)

daijro commented 1 year ago

Sorry for the late resposne; I've been very busy with schoolwork. This method is definitely very interesting though.

Heres what I've tried:

Attempt Result
I downloaded the CourseHero document py.pdf from your image, and searched Google's cache for other parts of the document using a Google dork: site:https://www.coursehero.com/file/85220007/py 8. What country controlled Kosovo after the breakup of Yugoslavia? I found that in fact all of the questions in the PDF were cached within Google.
Tried other documents to see if the same thing happened Very rarely did I find a document that had its text cached in Google
Sending requests to the page disguised as a google web crawler Page returns a 403 Forbidden error. From what it seems, they are blocking new requests from search engines.
Tried to find a cached version of the page Cache is not available. CourseHero is preventing Google from publicly displaying its cache.
I also found that CourseHero's cache is slowly disappearing from other search engines over the past few months.

My conclusion

I don't think CourseHero allows search engines to access the full content of documents anymore.

Search engine bots are blocked from their website, and CourseHero's cache of this site seems to be disappearing from other search engines over the past few months. This leads me to believe this cache might be old, and Google was simply reusing theirs. CourseHero might have once allowed search engines to access the full document, and this is where this cache is from.

The most I could find now of CourseHero's documents are snippets of their public previews. If I were to fully implement a method to scrape Google's cache, I don't think it would be consistent with other documents.

I'm open to other methods though and anything else anyone can find to help! Thank you! ❤️

Invisible40 commented 4 months ago

Any new methods?

7ih commented 1 month ago

Pretty easy method I found is to just google math/history/ etc notes and save the web pages as pdf and uploading those. You can upload basically anything as long as its somewhat related to school.

bwkam commented 1 month ago

Yeah, that worked for me too. But, I generated AI crap, and it would always accept it. Maybe I can automate that.