Closed jonganc closed 7 years ago
Could you speak a little more to hierarchical division you are seeing?
I'm still trying to wrap my head around this proposal and how it fits in with current activities, but one question I'm having is the "who" that is doing all these things. I think it might be helpful to map who is currently doing them and who we think might do them in the new system, as my sense is that there is some underlying assumption on that 'who' changing
@dcwalk as a starting point, each service is listed with a key contributor in the proposed services list If it's blank, there is currently no one working on it at the moment.
@jonganc I too would love to know more about your thinking here. As you mentioned a lot of these categories have equivalents in the current diagram (I'm totally down for name changes!), but it sounds like you're thinking more along the people side.
We definitely need to have a good think about how to break down these tasks & make them easy for people to take on. If you're alluding to having the services themselves map to areas of responsibility, color me interested :)
Division | Responsibilities | Related current tools & activities | People currently involved | Notes | Analogue in services diagram <...> indicates all of a group |
Crawly | web crawling backend | Parts of archivers 2.0 | Brendan | ½ Patchbay ⅓ Miru Sentry Internet Archive |
|
Changy | analyze website changes, including using AI | Page monitoring work | Andrew, Toly, Dan Allan | Organizationally, this is essentially unchanged | |
Grabby | infrastructure for scraping uncrawlable websites. Writing a scraping cookbook (?) | Harvesting tools repo, Miru | ? Zach. people in Boston and Colorado. Dan Allan | ⅓ Miru Uncrawlabes Spreadsheet Recipes |
|
Archy | URL pipeline app. URL seeding. | Archivers, chrome extension | Brendan, Kevin M, Dan Allan, Matt | The app allows people to specify whether URLs are crawlable or uncrawlable. It ultimately feeds into Crawly and Grabby. | Chrome Extension ½ Archivers ½ Patchbay ⅓ Miru |
Techy (?) | Overall infrastructure (e.g. identity management, stats, health). Connecting with external services / other archiving efforts |
Archivers | Brendan, ? | ||
Notey (?) | Add metadata to data. Part of Archy and/or Grabby? | Archivers | ? domain experts | Does all data need metadata added or just uncrawlable datasets? | ½ Archivers |
Talky (?) | Coordinate with agencies for direct DB dumps, filing FOI requests, writing primers | ? | ? Maya | Agency Primers | |
Looky (?) | Making front-ends for interfacing for DB’s. | ? | ? |
I ended up adding three more categories, though two of them (Talky, Looky) are essentially decoupled and the third (Techy) is sort of for tech back end stuff.
@dcwalk I made a chart showing my proposed divisions, what they involve, and who (as I understand) is currently involved in the related tasks. I agree that the "who" may shift a bit if the responsibilities change, and it's hard to see exactly where they end up. For example, here in Boston, we have a couple of volunteers who work in dev ops. They might be able to help with back end stuff but, because they aren't currently clear divisions, it's hard for them to use their knowledge.
@b5 I apologize because I don't think I spent as much time before appreciating how much thought you had put into your proposal; I missed some important tasks. In my chart, I try to show how your services fit in with my ideas but I'd like to hear about how you think about things. I think my thoughts are partly about organizing people but, perhaps just as much, about organizing and compartmentalizing responsibility. One person could be in multiple tasks, for example in Techy and Archy, but his responsibilities in the two roles would be slightly different. I think this can improve the collaborativeness, although there are some implications, for example, about how the archiving app would behave (I've previously advocating for having the archiving "app" be primarily responsible for sorting between crawlable and uncrawlable URL's).
Still trying to wrap my head around this, but it looks like this overlaps a lot with the DataRescue workflow – maybe it would be worth connecting the two?
@mhucka My thoughts were definitely motivated by things I saw and learned at DataRescue events. I tend to think that the DataRescue event tasks should be subservient to the overall goals of the project. But having said that, I don't think the event tasks would change substantially in this particular proposal, although the responsibilities of the seeders and researchers would probably involve things like checking if URLs are crawlable and/or connecting URL's with identified datasets.
If you guys think these ideas are interesting, I can make a directory and flesh it out more. I think some of the clarity comes when considering not just the individual divisions but the edges between them. For example, Crawly has to figure out how to provide data to Archy and Changy. Notey has to work with Grabby and Archy.
OK, I see.
One more comment, more of a personal preference I guess, but I'm not terribly fond of the "y" names ("archy", "grabby") :-\. But other people should chime in about that; maybe I'm being too academic and serious...
Lots to untangle here, first & foremost thanks for taking the time to get the ball rolling on these discussions!
I'm inclined to agree on not using the "y" names. Feedback from a variety of areas has been that we need to avoid inventing new terms when possible, as they create confusion for newcomers. The current proposal is guilty of this as well, to mitigate this I think adding functional descriptions alongside / before service names to add clarity is a good start (as per @mhucka's suggestion on Saturday's call).
With that being said, I think @jonganc has a very, very valid point in trying to discern a clear strategy for understanding who will ultimately be responsible for what. One of the biggest challenges for this project is the number of bases that we'd need to cover in order to get this done properly.
But before we get to dividing labor, I'd like to back up a bit & spend some time thinking about how to get people to engage in the first place. All of this is built on volunteered time, which means we need to convince people that what we're up to is worthwhile. On top of that, I'm not sure we can depend on being able to tell a volunteer what to work on, so we have to sell them first on the project & then on a task that needs doing. In my head this is the beauty of open source work, we don't assign tasks, we convince capable people of the value in accomplishing a task.
The bigger end of the task spectrum is convincing someone to be a project maintainer. I think this touches many, many different areas that we can improve upon, but if given the choice I'd focus the discussion of responsibilities on convincing people to help instead of on how to divide labour. For insights I think we could start by putting a few questions to our current project maintainers to understand what got them to sign up in the first place, I'll make a separate issue for these questions, but I think it'd be great to continue the discussion here about the intersection of tasks & responsibilities.
@ebarry is an absolute champion on the distributed, community-driven project frontier, we'll have to see what she thinks upon returning from the South Pole.
Lastly, I can't recommend The Cathedral and The Bazar highly enough for understanding the human side of the open source model.
I don't really mind changing the names. I mostly made them up because I was bored. Just a note: I'm not suggest to tell people what to do at all. The concern I'm trying to address is a bit of the opposite: if the organization is too nebulous and the goals too unclear, I think volunteers may feel that 1) the project is too disorganized to accomplish much, and/or 2) they don't want to work hard on something that ends up not fitting in the project goals and thus ending up unused. I think having a scaffolding would increase, rather than decrease, volunteerism. It would let people know: if you want to be part of the project, here are like 500 different areas where we need help! Also, this is not a typical open source project, since we're building more of a platform than an application. I think some decisions can't be made ad-hoc.
Phew, I'm glad we took the time to work out communication here. Sounds like we may have been describing similar things all along :)
I'm in deep agreement that we need to do everything we can to have a clear set of asks on both the micro & macro scales. To your point about having 500 places where we need help, in my head those are best expressed as open issues on public repositories. To me our project lives & dies on maintained issue queues & any similar discussions (e.g. Slack). On the macro end, well, we have this proposal, and the ensuing discussion. I agree that we have a long way to go on this frontier, but remain faithful that we're headed in a good direction to make these sorts of improvements.
Long story short, I do think we should have a plan/scaffold, but I don't think we should take it too seriously, as we'll inevitably have to adapt to challenges & opportunities as they come. I'm less worried about dividing responsibilities than I am about the very salient points raised here about clear communication of needs, and convincing as many people as possible to join us in whatever way they feel comfortable.
I think it's worth considering a more hierarchical division of tasks. This is a more conceptual and human-organizational approach, and the underlying infrastructure might actually look like Brendan's service diagram.
I think about the division of tasks as follows (I've given the divisions cute names because I'm lonely and bored)
Crawly web crawling backend
Changy analyze website changes
Grabby infrastructure for scraping uncrawlable websites
Archy URL coordination. From current archivers app.
Plays role in deciding whether websites are crawlable and seeding crawler, as well as sending URLs to be set up for Grabby
Notey Add metadata to data. (Only to data in Grabby? any data in Crawly?). Deduplicate data scraping efforts. May be merged somehow with Grabby and Archy.