apify / apify-docs

This project is the home of Apify's documentation.
https://docs.apify.com
Apache License 2.0
26 stars 73 forks source link

Proposal on future structure of courses #1015

Open honzajavorek opened 4 months ago

honzajavorek commented 4 months ago

This is a structure of courses I propose we should gravitate towards. As of now, it is a rough structure which will get more detailed over time - splitting, merging, renaming, etc. is expected as part of the evolution.

flowchart TB
    subgraph start["Getting started"]
        direction LR
        beginner_js(Introduction to scraping<br>for JavaScript developers)
        beginner_py(Introduction to scraping<br>for Python developers)
        beginner_js ~~~ beginner_py
    end

    subgraph advanced["Learning advanced techiques"]
        direction LR
        browsers(Advanced scraping<br>with browsers)
        apis(Advanced scraping<br>with APIs)
        anti(Avoiding<br>anti-scraping protections)
        browsers ~~~ apis
    end

    subgraph simplify["Making life easier"]
        direction LR
        frameworks(Using frameworks<br>to simplify scraping)
        platforms(Using platforms<br>to simplify scraping)
        frameworks ~~~ platforms
    end

    start-->advanced
    start-->simplify
    advanced-->simplify

This issue is an elaboration of what I earlier described internally with the following words:

My hunch is there could be Web Scraping Basics in JS, Web Scraping Basics in Python, Web Scraping with Browsers, Web Scraping of APIs, etc. If we connect these courses to one learning path and call it a Web Scraping Zero to Hero Learning Path™, it can easily also have a landing page and some marketing content, so the actual number of actual courses doesn't concern me that much.

Include one or two courses on how to start with Apify, but not more. Something like Web Scraping with Apify, to complete the learning path. Maybe even something more inconspicuous, such as Getting Productive with Web Scraping Platforms, where we'd teach people why and how to use e.g. proxies, and only then mentioning that Apify has awesome proxies, and use our platform as something the student uses hands-on as an example.

In Getting Productive with Web Scraping Platforms course we'd teach people how platforms in general can help them to avoid a lot of heavy lifting. In the lessons, we would lay out the problems, explain the solutions, and then show, hands-on, how Apify can be used as a solution. The best Apify advertisement - shows the advantages in a contrast to manual solutions.

B4nan commented 4 months ago

Some time ago I found this, not sure if you saw that already:

https://diataxis.fr/

honzajavorek commented 4 months ago

Yes, I'm a fan of diataxis. A single course should consist of lessons and a each lesson can take the diataxis approach, as proposed here https://github.com/evildmp/diataxis-documentation-framework/discussions/130. Also, current "tutorials" are clearly How-to guides as defined by diataxis, and I want to keep them as such, but that's outside of the scope of the course flow above.

mnmkng commented 4 months ago

What's in the "Using frameworks to simplify scraping" part? Do you plan to move all Crawlee related content in there, or is that something even more advanced?

honzajavorek commented 4 months ago

@mnmkng You made me thinking! The idea when drawing the chart was we explain basic concepts and then in a separate course we show people they can use frameworks (Crawlee, Scrapy) to achieve the same and more, but simpler.

But your question brings me to a better approach 💡 All the courses should start with simple tools, but lead people to using frameworks in the end, demonstrating why they're useful on the way. The same could be done with platforms.

Maybe there will be some topics left which could form a separate "Using frameworks/platforms to simplify scraping" course, maybe not. But these two shouldn't be separate courses, they should be layers each course culminates to.

honzajavorek commented 4 months ago

Hierarchy of courses:

flowchart TB
    subgraph start["Getting started"]
        direction LR
        beginner_js(Web scraping basics<br>for JavaScript devs)
        beginner_py(Web scraping basics<br>for Python devs)
        beginner_js ~~~ beginner_py
    end

    subgraph advanced["Learning advanced techiques"]
        direction LR
        browsers(Web scraping with browsers)
        apis(Web scraping with APIs)
        anti(Navigating anti-scraping protections)
        browsers ~~~ apis
    end

    start --> advanced

Structure of a single course

flowchart TB
    subgraph advanced["Course"]
        direction TB
        home(State requirements,<br>promises, motivation)-->basic(Teach basics<br>with basic tools)-->framework(Use framework<br>to simplify code or<br>allow advanced goal)-->platform(Use platform<br>to simplify code or <br>allow advanced goal)
    end
honzajavorek commented 4 months ago

I changed names of the courses in the chart above to

metalwarrior665 commented 4 months ago

Looks good to me. The Browsers vs API scraping can be in a way put against each other with the typical pros & cons page.

Historically, I wanted to have some super-pro course, something like "High scale scraping" with things like recursive pagination, reverse engineering JS etc. Basically a final stage of the journey. Unfortunately, I failed to deploy it in meaningful form (this crazy PR still exists). I think we can easily add that later if we find someone to write it.

honzajavorek commented 4 months ago

Yup, definitely there should be a page which clearly explains where browsers are the best fit and where APIs are the best fit (which will obviously lean towards recommending everything else than browsers if possible). Ideally a page which can be shared between those two courses.

Regarding super-pro techniques, I wonder if it's a field which allows for creation of a step-by-step course, or if it's more like scenarios which you search for once you bump into them, and then look for a canned solution to that particular problem. Because in such case it might make sense to have it as a collection of how-tos.

But that's something we can figure out later. I didn't know about the PR, so it's good you mentioned it. I'll keep it in mind.

mnmkng commented 4 months ago

Btw, note that the most popular topic in web scraping, and the most sought after guides nowadays are all about bypassing anti scraping protections. Primarily Cloudflare, but also CAPTCHAs and other annoying blocks. So we should definitely keep in mind that expanding the anti-blocking section is one of the priorities.