Open honzajavorek opened 6 months ago
Some time ago I found this, not sure if you saw that already:
Yes, I'm a fan of diataxis. A single course should consist of lessons and a each lesson can take the diataxis approach, as proposed here https://github.com/evildmp/diataxis-documentation-framework/discussions/130. Also, current "tutorials" are clearly How-to guides as defined by diataxis, and I want to keep them as such, but that's outside of the scope of the course flow above.
What's in the "Using frameworks to simplify scraping" part? Do you plan to move all Crawlee related content in there, or is that something even more advanced?
@mnmkng You made me thinking! The idea when drawing the chart was we explain basic concepts and then in a separate course we show people they can use frameworks (Crawlee, Scrapy) to achieve the same and more, but simpler.
But your question brings me to a better approach 💡 All the courses should start with simple tools, but lead people to using frameworks in the end, demonstrating why they're useful on the way. The same could be done with platforms.
Maybe there will be some topics left which could form a separate "Using frameworks/platforms to simplify scraping" course, maybe not. But these two shouldn't be separate courses, they should be layers each course culminates to.
flowchart TB
subgraph start["Getting started"]
direction LR
beginner_js(Web scraping basics<br>for JavaScript devs)
beginner_py(Web scraping basics<br>for Python devs)
beginner_js ~~~ beginner_py
end
subgraph advanced["Learning advanced techiques"]
direction LR
browsers(Web scraping with browsers)
apis(Web scraping with APIs)
anti(Navigating anti-scraping protections)
browsers ~~~ apis
end
start --> advanced
flowchart TB
subgraph advanced["Course"]
direction TB
home(State requirements,<br>promises, motivation)-->basic(Teach basics<br>with basic tools)-->framework(Use framework<br>to simplify code or<br>allow advanced goal)-->platform(Use platform<br>to simplify code or <br>allow advanced goal)
end
I changed names of the courses in the chart above to
Looks good to me. The Browsers vs API scraping can be in a way put against each other with the typical pros & cons page.
Historically, I wanted to have some super-pro course, something like "High scale scraping" with things like recursive pagination, reverse engineering JS etc. Basically a final stage of the journey. Unfortunately, I failed to deploy it in meaningful form (this crazy PR still exists). I think we can easily add that later if we find someone to write it.
Yup, definitely there should be a page which clearly explains where browsers are the best fit and where APIs are the best fit (which will obviously lean towards recommending everything else than browsers if possible). Ideally a page which can be shared between those two courses.
Regarding super-pro techniques, I wonder if it's a field which allows for creation of a step-by-step course, or if it's more like scenarios which you search for once you bump into them, and then look for a canned solution to that particular problem. Because in such case it might make sense to have it as a collection of how-tos.
But that's something we can figure out later. I didn't know about the PR, so it's good you mentioned it. I'll keep it in mind.
Btw, note that the most popular topic in web scraping, and the most sought after guides nowadays are all about bypassing anti scraping protections. Primarily Cloudflare, but also CAPTCHAs and other annoying blocks. So we should definitely keep in mind that expanding the anti-blocking section is one of the priorities.
This is a structure of courses I propose we should gravitate towards. As of now, it is a rough structure which will get more detailed over time - splitting, merging, renaming, etc. is expected as part of the evolution.
This issue is an elaboration of what I earlier described internally with the following words: