carpentries-incubator / lc-webscraping

Introduction to web scraping
https://carpentries-incubator.github.io/lc-webscraping/
Other
37 stars 28 forks source link

Pull request for post-workshop edits for webscraping #27

Closed Denubis closed 1 year ago

Denubis commented 5 years ago

Hi Thomas,

The lesson went well and I think my edits to the pre-scrapy stuff flowed very well. I've made one change after the workshop and that's to move scrapy shell up to the dom. Running the javascript was a distraction and when talking to the instructors we realised that if we hit the scrapy shell earlier, we could do the same thing and be alert to problems with learners not having stuff installed, not having python experience, or otherwise to be able to change course before we got to scrapy.

I'd be delighted to do a proper debrief with you whenever you wish.

I just dumped all the images I had in. I think... most of them were used?

Cheers, -Brian

weaverbel commented 5 years ago

Hey @Denubis

I am not averse to putting changes through but generally this kind of large scale change would be done incrementally. It is very large for one commit, and it might be that the changes to one bit are acceptable to the maintainer and the changes to another are not. So just PR-wise, it is good to make commits small and many rather than one large file.

Is it possible to turn this into a section by section commit as that would be more easily digestible? I am not sure Thomas is still active in this lesson - we copied his lesson here and he may not want to keep maintaining it. The person who has been active in webscraping is @jnothman who developed a different version of the webscraping lesson - I think that is still sitting in data-lessons - is that right Joel? I can bring it over if need be and we should probably open an issue here to discuss what this lesson really ought to cover and how to do it best. If anyone is keen to be the lesson maintainer that would be good. Apart from Thomas, I am down for it, but really it's not my area, and I have the shell lesson to manage.

Anyway, please let me know what you think Brian. It is great to have some updates to the lesson and I thank you for putting in the time, but it is a lot for a maintainers to deal with in such enormous chunks of changes.

Denubis commented 5 years ago

Hi @weaverbel I had a chat with Thomas before I started working on the lesson and embarking on a fairly major rewrite. I didn't get the sense that he was passing on ownership (I suspect that the same will be true on my capstone submission, if you still have a maintainer for that).

Because this was a major rewrite (and having run it successfully at resbaz), I'm not quite sure how to chunk down the commits. (And Thomas didn't mention other rewrites in progress...)

I wouldn't be adverse to being maintainer for this and the capstone. I'm fairly happy with my edits and think that they can form a good base going forward.

Looking at https://carpentries.github.io/maintainer-onboarding/01-social/index.html, this is a process I think Macquarie would support me going through.

@jnothman: Perhaps we should have a chat about your vision for this lesson? And then I can make you a new set of commits which match that vision?

jnothman commented 5 years ago

The fact that the lesson has already diverged from what my edition was based on, and then Brian's diverged again does make this hard from the perspective of git games.

Managing such conflicts. In terms of general management of these contentions, I think it would be a good idea if LC/SC had a class of lessons that were in a process of development (i.e. it's clear that Thomas's lesson was immature) rather than stable, and wrote a procedure for acquiring the lesson for a period of development, or for obtaining consensus around proposed changes.

Broad vision. More specifically, I have elsewhere suggested to @denubis that in our role in supporting data-driven humanities (although this would be relevant to librarians too) might be better served by teaching resources more broadly in web-based data acquisition. I've not yet ascertained clear use cases for web scraping as such in research, but I can see that researchers would benefit from literacy around several related data collection methodologies:

Resolving this conflict. I think it is a pity that my revisions to the lesson were not adopted here; I can certainly sympathise with Brian's position. But despite my investment in this last year, I feel no propriety about where this goes. I am keen to have this left in a more mature form, but I have little time to spend on it. This is much less fresh in my mind, and it would be interesting if Brian can review the key differences between the editions before we review where it goes.

I would be keen to meet on this topic and the broader vision of supporting web data collection literacy in research. Perhaps in 13-16 August?

Denubis commented 5 years ago

So, let's split this into a number of discussions, because I think they're all valuable.

1) I'd be delighted to work with you to create a lesson ab novo covering all of those topics, since I agree that "scraping" should be the last resort. Perhaps it's worth beta-testing the new modules process as part of that? Maybe we can break those up into 5 2-4 hour lessons? Since I know that sparql is its own module, scraping (obviously), web API interactions, etc. And I think a collaboration would be much more valuable in the creation of these lessons than either of us working alone, since we can each ground the other. The thing that is sparking this thought is: https://carpentries.org/blog/2018/07/curriculum-vision and I think this mutual desire to build out data-humanitities-modules could be a useful inspiration.

2) I don't think that lesson is this lesson, though we will almost certainly mine this lesson in all of its parts for content. Over the next while I'll go through the two lessons and do a manual diff on broad topics. We can then figure out a topical structure to work from, and then from which of our two forks to pull to achieve that. Perhaps with the goal of this being module 5 in the series outlined above? I can't really commit to a meeting next week. (Technically I've been ordered to take today off because I've been putting in too much time.), but I'll start work on the diffing in https://pad.carpentries.org/webscraping-diff. I'll let you know when I've got basic outlines up.

@jnothman I think I'm having a brainfart. Can you link your fork here?

jnothman commented 5 years ago

Thank you for pointing me to the curriculum vision.

One of the things I'm struggling with in terms of (1) is that a lot of this stuff is hard to develop a separate module on. What I would mostly like researchers/librarians to know about this stuff is that it exists and what kind of data they can/cannot get out of it. So I would not focus on teaching each tool in depth, but emphasise literacy and hence breadth.

I'm happy to meet later, but I have more limited availability in Sept-Oct. ​

jnothman commented 5 years ago

See https://github.com/data-lessons/library-webscraping-DEPRECATED