Closed ashander closed 9 years ago
ping @leeper
Sorry, this ended up being a really long comment
I think it's definitely correct that the task view could be better. Given the number of packages it lists, I'd been wondering whether a simple listing of packages even makes sense. Many of the other task views describe general categories of analysis and then list packages that can be helpful in those specific domains, which I think is something we should consider if we're going to do a larger reform.
On the specific proposal, I think it's right that the task view has multiple sections. Looking at it now, "Tools for Working with the Web from R" is a pretty bad subheading for the first half of the task view. Basically, we have low level core libraries (RCurl, curl, httr, XML, the json packages, RSelenium), higher level core libraries (rvest, httr, selectr), server-side stuff (Shiny, Rook, etc.). I think we could break this down more clearly and be slightly more opinionated. For example, instead of listing RCurl, httr, and curl as separate points, we should say:
Scraping web pages and importing data into R: httr provides a user-friendly interface for executing HTTP methods (GET, POST, PUT, etc.) and provides support for modern web authentication protocols (OAuth 1.0, OAuth 2.0). RCurl is a lower-level package that provides a closer interface between R and the libcurl C library, but is less user-friendly. curl is another libcurl client that provides the
curl()
function as an SSL-compatible replacement for base R'surl()
. For dynamically generated webpages (i.e., those requiring user interaction to display results), RSelenium can be used to automate those interactions and extract page contents. For websites serving insecure HTTP (i.e. using the "http" not "https" prefix), most R functions can extract data directly, includingread.table
andread.csv
; this also applies to functions in add-on packages such asjsonlite::fromJSON
andXML::parseXML
.
That cuts several paragraphs to one and builds from what you might be trying to do to point the user toward tools for achieving it.
So what does this mean? I'd suggest we start by cleaning up the current text a bit to be more task-driven rather than package-driven. This should make the first half of the task view more helpful.
The second half of the task view is more difficult, though, and I'm not sure it's reasonable to try to split it into two subsections or separate task views (one for APIs/services and one for data). The line between these is really blurry. Twitter is a web service - someone might want to create an R package that posts to twitter using the twitteR package - but it is also a data source - social and computer scientists want to scrape network graphs all the time and treat twitteR as data. Some web tech is clearly data (a lot of the ecology and biology packages are) and some are clearly services (Rmonkey, qualtrics), but there's a murky middle ground in between and I'm not sure the split is so easy. It may make sense to separate all of the second half into its own task view, though.
@ashander Agree that this task view needs splitting.
@leeper I like that proposed summary paragraph - I think we have some of that, but could use more for sure.
If it sounds like the listicle of packages part of this repo should be split off, I don't think that needs to be a task view per se - it could just be a github repo and not get sent to cran. Maybe even a gh-pages site.
I agree with @leeper that it's blurry distinction between services and data sources. Though if we put the 2nd half in a separate repo, we could attempt to split those at least somewhat, or have tags or something like that to support the venn diagram nature of the packages.
Particularly for users attempting to discover ways of accessing machine-readable data through R
@ashander how can this be made easier? good enough to have a list of them with short description? or something else
Thanks for both your thoughts, and being receptive to the suggestion.
It seems there's consensus to split the first half off, and perhaps expand the prose section and improve the breakdown among packages on the list in the first half. For the second half, the primary goal in my view of splitting off the data access packages is discoverablity. This would be (primarily) facilitated by a more descriptive taskview name and (secondarily) a more curated list of packages with descriptive subheaders.
So, responding to @schkott's query, there are a few ways that simply having a more descriptively named repo could help:
The same good outcomes might flow from having a separate view for api wrappers for services folks use for data science (@leeper's cloudr repos, other AWS stuff on here, DO wrapper, bigquery, etc) or web stuff (analytics apis), which would make a case for 3. On the other hand -- maybe two views is enough. In that case, I'd actually suggest repos that fall in these categories are a better fit with the first half "Web Tech" view.
One possible breakdown to two views:
This view focuses on open data (sometimes requiring an account) but also for convenience lists paid services
On Tue, Apr 14, 2015 at 7:36 AM, Scott Chamberlain <notifications@github.com
wrote:
@ashander https://github.com/ashander Agree that this task view needs splitting.
@leeper https://github.com/leeper I like that proposed summary paragraph - I think we have some of that, but could use more for sure.
If it sounds like the listicle of packages part of this repo should be split off, I don't think that needs to be a task view per se - it could just be a github repo and not get sent to cran. Maybe even a gh-pages site.
I agree with @leeper https://github.com/leeper that it's blurry distinction between services and data sources. Though if we put the 2nd half in a separate repo, we could attempt to split those at least somewhat, or have tags or something like that to support the venn diagram nature of the packages.
Particularly for users attempting to discover ways of accessing machine-readable data through R
@ashander https://github.com/ashander how can this be made easier? good enough to have a list of them with short description? or something else
— Reply to this email directly or view it on GitHub https://github.com/ropensci/webservices/issues/205#issuecomment-92880902 .
want to chime in @karthik @cboettig - no worries if no opinion on this
Great discussion, I think @ashander makes some very good points about the value of splitting things up. I think the split into data vs tech is reasonable. I like @leeper 's suggestions about better descriptive text as well.
I do agree that some packages, like twitteR, are very much in both camps -- web tech from the perspective of posting information to a particular service, but data from the perspective of scraping a given pool of open data. No problem: I suggest such packages be listed in both -- CRAN taskviews are focused on tasks after all, and this means some packages already appear in multiple taskviews.
Do you need CRAN's blessing to split the taskview?
Do you need CRAN's blessing to split the taskview?
I can ask, i imagine it's no problem
Is it really worth splitting the task view itself as opposed to organizing it better?
I'm not sure we need to split, but I do like the idea of an open data task view, which might also allow for a single place to keep track of data-only packages that are not web-based. On Apr 14, 2015 2:14 PM, "Karthik Ram" notifications@github.com wrote:
Is it really worth splitting the task view itself as opposed to organizing it better?
— Reply to this email directly or view it on GitHub https://github.com/ropensci/webservices/issues/205#issuecomment-93024700 .
@leeper
data-only packages that are not web-based.
do you mean pkgs that come with data, and don't make web requests for data?
@sckott Yes, I'm thinking like https://github.com/hadley/nycflights13 and https://github.com/jennybc/gapminder
I outlined the main rationale for splitting above, which comes down to matching the title to the task of data access.
Better organization would help, and makes sense especially for the web tech side of things. For data access, I think the current organization is fine but the name of the task view doesn't describe the task, making it less visible to users
On Tue, Apr 14, 2015 at 2:43 PM, Thomas J. Leeper notifications@github.com wrote:
@sckott https://github.com/sckott Yes, I'm thinking like https://github.com/hadley/nycflights13 and https://github.com/jennybc/gapminder
— Reply to this email directly or view it on GitHub https://github.com/ropensci/webservices/issues/205#issuecomment-93075756 .
@karthik worth splitting? there's not really any cost to splitting - i guess in a new repo that splits off we'd lose the github stars, but other than that...
@leeper that sounds good, to include non-web data sources - BUT i wonder where that stops though, cause lots of packages have at least a small dataset or two. So would focus be just on data only packages? Primary only data?
Yes, there would have to be a clear set of boundaries. An open data task view probably shouldn't describe every trivial dataset available in every package, but instead focus on:
@leeper Okay, sounds good
@ashander @leeper I think we should go ahead with a split, with
web tech and open data
we should definitely do as leeper said and make more useful narrative summaries of each section
and its okay if the same package is mentioned in both
open data started https://github.com/ropensci/opendata
Cool. I'm happy to contribute a draft PR here implementing the split outlined above. I could do this next week.
Is there any desire or need to keep git history for those lines that will end up only in open data ?
On Fri, Apr 17, 2015 at 2:27 PM, Scott Chamberlain notifications@github.com wrote:
open data started https://github.com/ropensci/opendata
Reply to this email directly or view it on GitHub: https://github.com/ropensci/webservices/issues/205#issuecomment-94078728
Of course, I have no issue if someone else wants to draft it either. In any case, I've edited the discussion goals checklist above to reflect decision. Now there's only
@ashander maybe we should copy contents of this into the new repo with all git history, etc, but with the new repo name, and diverge from there for open data (and remove anything from here as needed). @leeper does that make sense? Or another way?
Yes that seems fine. I think in practice split could be made as two PRs
To retain history to new repo, basing the second of these on a clone of this repo makes sense
- Jaime
On Fri, Apr 17, 2015 at 2:57 PM, Scott Chamberlain notifications@github.com wrote:
@ashander maybe we should copy contents of this into the new repo with all git history, etc, but with the new repo name, and diverge from there for open data (and remove anything from here as needed). @leeper does that make sense? Or another way?
Reply to this email directly or view it on GitHub: https://github.com/ropensci/webservices/issues/205#issuecomment-94084134
@ashander sounds good, PR here to remove - but after doing PR to new repo with open data content (or clone)
:+1: Nice to keep the full history in both repos to keep track of the fork.
I just initiated a pull request to the opendata repo that basically contains most of the content from here (except the web frameworks section and some packages that are strictly services not data) and adds a bunch of data packages. It still needs to be cleaned up quite a bit, but it should get us started. Then we start trimming this repo down to just be webtech packages.
I also opened some issues today on this repo that are probably actually just data packages that belong on the other repo, so I can migrate those later.
:+1:
I'm working on the rewrite of this task view now and will send a commit shortly.
Thanks @leeper clearly last week's todo list had slipped into this week
If it's not a duplication of effort, I'll put in some work on cleaning up the new open data task view
- Jaime
On Mon, Apr 27, 2015 at 8:15 AM, Thomas J. Leeper notifications@github.com wrote:
I'm working on the rewrite of this task view now and will send a commit shortly.
Reply to this email directly or view it on GitHub: https://github.com/ropensci/webservices/issues/205#issuecomment-96697231
@ashander Sounds great!
Okay, I've committed a draft basically trying to follow the outline you suggested, @ashander. I'm not sure I was wholly successful because the first list seems too long to me and the last section is just a list of miscellaneous stuff that's not really organized in any way (and we can probably cut some of the packages b/c they'll be represented on OpenData but wasn't sure in some cases). I cut most of the previous headers because almost all of the sections would have only had a few bullet points (given that they were mostly data packages that are now gone).
@sckott Thoughts?
This looks really good!
@ashander great points! I've made some further revisions - basically adding some headings and moving things around. I also dropped some more packages from the final list and consolidated all of the ones related to scholarly metadata.
The one big task we need to do is cleanup the "Web Frameworks" section. Presumably it should be (1) Shiny (and uses thereof), (2) OpenCPU, (3) server tools, (4) miscellaneous?
Awesome. I have some real work to do on opendata to get it up to this standard.
That breakdown seems good in theory, but within current "web and server frameworks" sections 1) shiny would include shiny and one other package, 2) OpenCPU would only include itself. I'm not sure of tools vs misc from a quick look.
Here's a concept that avoids that singleton issue:
rename section to "Server-side tools and frameworks" then have a section for full-featured frameworks (opencpu, shiny), simple interfaces to basic protocols (servr, websockets, httpuv), and misc (everything else)
@ashander, I like your suggested breakdown. We could also have Shiny as a section with discussion of packages to customize it (shinythemes, shinybootstrap2, etc.), and then document a few examples of shiny packages on CRAN (e.g., radiant, ggvis, etc.) or elsewhere.
That still leaves opencpu by itself. I haven't worked with it enough to write a substantial text describing it. @jeroenooms, would you be able to give us a blurb for that?
This taskview provides a fantastic list of resources. From a users' perspective though, I feel it may be hard to 'get' what is included in the view from its title and description. I think it would be more useful and visible to cran users, github searchers, and those googling if it were split. I'm filing this issue as a peg to discuss this idea.
case to split
Currently the view contains two level-2 html headers,
Tools for Working with the Web from R
andData Sources on the Web Accessible via R
. Further, the latter category includes both R packages designed to work with free or paid services (e.g. Google Analytics, Dropbox, Box, Big Query, Yhat) and packages designed to retrieve free or public data from government, industry or NGO sources.I'd argue that this taskview is currently trying to do 3 things:
Particularly for users attempting to discover ways of accessing machine-readable data through R, amalgamation of category 3 into a taskview called Web Technologies is not optimal. Thus, although it'd be some effort, I argue splitting the taskview would be good for ropensci's core audience -- scientists.
strategy for splitting
outcomes
contribution
If community agrees a split is desirable, I can draft a PR implementing one.