aditya-raj1 / GFG-scrapper

Scrape GFG
0 stars 2 forks source link

Geeks for Geeks PDFs

Table of Contents of the Dynamic Programming Book.

Download the PDFs from the releases page.

I started in 2015 from @gnijuohz's repo, but now (in 2018) I've re-written pretty much every part of the process.

Dependencies

Running the code

  1. First, find out a "topic url" for what you want to download. Eg:

    • https://www.geeksforgeeks.org/tag/samsung/
    • https://www.geeksforgeeks.org/category/dynamic-programming/
  2. Create a JSON containing links of all posts on that topic

    • python3.6 list_links.py https://www.geeksforgeeks.org/tag/samsung/

    • This JSON can now be edited by hand, to remove some links, re-order them etc.

  3. Now fetch the actual posts

    • python3.6 download_html.py JSON/Samsung.json
  4. Finally, convert the HTML to a PDF using Pandoc

    • python3.6 html_to_pdf.py HTML/Samsung.html

Things will work only if you're really lucky. This project has taught me how fragile my HTML to PDF pipeline really is. There's just too many things that can go wrong.

What could go wrong

Topic URLs

List of Topic URLs that have I've fetched. You can download these from the releases page.

Algorithms

Data Strucutres

Companies

To Compare more of the gfg scrapper