Split the taskvew into Web Data Sources, Web Services, and Web Tools

ashander commented 9 years ago

This taskview provides a fantastic list of resources. From a users' perspective though, I feel it may be hard to 'get' what is included in the view from its title and description. I think it would be more useful and visible to cran users, github searchers, and those googling if it were split. I'm filing this issue as a peg to discuss this idea.

case to split

Currently the view contains two level-2 html headers, Tools for Working with the Web from R and Data Sources on the Web Accessible via R. Further, the latter category includes both R packages designed to work with free or paid services (e.g. Google Analytics, Dropbox, Box, Big Query, Yhat) and packages designed to retrieve free or public data from government, industry or NGO sources.

I'd argue that this taskview is currently trying to do 3 things:

list packages that provide tools to work with web infrastructure (Web Tools)
list packages that use the tools to work with service provider apis (Web Services)
list packages to use the tools work with data-provider apis (Web Data Sources)

Particularly for users attempting to discover ways of accessing machine-readable data through R, amalgamation of category 3 into a taskview called Web Technologies is not optimal. Thus, although it'd be some effort, I argue splitting the taskview would be good for ropensci's core audience -- scientists.

strategy for splitting

splitting the tools off is easy -- that's the first header
the second split is harder -- is there an easy heuristic to split these (e.g., is the API wrapped by the package a paid or tiered service?)
outcomes
[x] is a split desirable?
[x] split to 2 or 3 (if yes above) -- split to two web tech and open data
[x] what strategy to split 2 from 3? (if so) -- no need, overall strategy is as outlined below. (Overlap is ok)
contribution

If community agrees a split is desirable, I can draft a PR implementing one.

sckott commented 9 years ago

ping @leeper

leeper commented 9 years ago

Sorry, this ended up being a really long comment

I think it's definitely correct that the task view could be better. Given the number of packages it lists, I'd been wondering whether a simple listing of packages even makes sense. Many of the other task views describe general categories of analysis and then list packages that can be helpful in those specific domains, which I think is something we should consider if we're going to do a larger reform.

On the specific proposal, I think it's right that the task view has multiple sections. Looking at it now, "Tools for Working with the Web from R" is a pretty bad subheading for the first half of the task view. Basically, we have low level core libraries (RCurl, curl, httr, XML, the json packages, RSelenium), higher level core libraries (rvest, httr, selectr), server-side stuff (Shiny, Rook, etc.). I think we could break this down more clearly and be slightly more opinionated. For example, instead of listing RCurl, httr, and curl as separate points, we should say:

Scraping web pages and importing data into R: httr provides a user-friendly interface for executing HTTP methods (GET, POST, PUT, etc.) and provides support for modern web authentication protocols (OAuth 1.0, OAuth 2.0). RCurl is a lower-level package that provides a closer interface between R and the libcurl C library, but is less user-friendly. curl is another libcurl client that provides the curl() function as an SSL-compatible replacement for base R's url(). For dynamically generated webpages (i.e., those requiring user interaction to display results), RSelenium can be used to automate those interactions and extract page contents. For websites serving insecure HTTP (i.e. using the "http" not "https" prefix), most R functions can extract data directly, including read.table and read.csv; this also applies to functions in add-on packages such as jsonlite::fromJSON and XML::parseXML.

That cuts several paragraphs to one and builds from what you might be trying to do to point the user toward tools for achieving it.

So what does this mean? I'd suggest we start by cleaning up the current text a bit to be more task-driven rather than package-driven. This should make the first half of the task view more helpful.

The second half of the task view is more difficult, though, and I'm not sure it's reasonable to try to split it into two subsections or separate task views (one for APIs/services and one for data). The line between these is really blurry. Twitter is a web service - someone might want to create an R package that posts to twitter using the twitteR package - but it is also a data source - social and computer scientists want to scrape network graphs all the time and treat twitteR as data. Some web tech is clearly data (a lot of the ecology and biology packages are) and some are clearly services (Rmonkey, qualtrics), but there's a murky middle ground in between and I'm not sure the split is so easy. It may make sense to separate all of the second half into its own task view, though.

sckott commented 9 years ago

@ashander Agree that this task view needs splitting.

@leeper I like that proposed summary paragraph - I think we have some of that, but could use more for sure.

If it sounds like the listicle of packages part of this repo should be split off, I don't think that needs to be a task view per se - it could just be a github repo and not get sent to cran. Maybe even a gh-pages site.

I agree with @leeper that it's blurry distinction between services and data sources. Though if we put the 2nd half in a separate repo, we could attempt to split those at least somewhat, or have tags or something like that to support the venn diagram nature of the packages.

Particularly for users attempting to discover ways of accessing machine-readable data through R

@ashander how can this be made easier? good enough to have a list of them with short description? or something else

ashander commented 9 years ago

Thanks for both your thoughts, and being receptive to the suggestion.

It seems there's consensus to split the first half off, and perhaps expand the prose section and improve the breakdown among packages on the list in the first half. For the second half, the primary goal in my view of splitting off the data access packages is discoverablity. This would be (primarily) facilitated by a more descriptive taskview name and (secondarily) a more curated list of packages with descriptive subheaders.

So, responding to @schkott's query, there are a few ways that simply having a more descriptively named repo could help:

a taskview with a title that highlights data access will increase google and other visibility
facilitates reaching out to other taskview authors to link back to this view-- for example the view @ucfagis maintains on analyzing ecological data could link to these data sources -- will be easier and more useful to users of those taskviews. For example, if I'm looking for data access I"m more to click through to a link called "open data access" than "web technologies"
similarly, a link from the end of rOpensci's packages repo ( https://ropensci.org/packages/) to the "open data access" task view on CRAN is more likely to appeal if none of the rOpensci packages fit the bill

The same good outcomes might flow from having a separate view for api wrappers for services folks use for data science (@leeper's cloudr repos, other AWS stuff on here, DO wrapper, bigquery, etc) or web stuff (analytics apis), which would make a case for 3. On the other hand -- maybe two views is enough. In that case, I'd actually suggest repos that fall in these categories are a better fit with the first half "Web Tech" view.

One possible breakdown to two views:

Web Tech task veiw

low level core libraries (RCurl, curl, httr, XML, the json packages, RSelenium),
higher level core libraries (rvest, httr, selectr)
server-side stuff (Shiny, Rook, etc.).
cloud data / analysis apis (AWS, dropbox, box, google bigquery, ML as a service, DS etc)
website analytics / web marketing apis
geo services (eg lawn, google maps etc -- some overlap with below)
social (some overlap with below)
collab: slack etc

Open data access task view

This view focuses on open data (sometimes requiring an account) but also for convenience lists paid services

agricultre
astro
chem
data depos (maybe broken out?)
data collection: survey monkey etc
earthsci
eeb
econ
finance
genes
geocoding
gov
lit
maps
media
ncbi
news
other
public health
social +google trends
wiki

On Tue, Apr 14, 2015 at 7:36 AM, Scott Chamberlain <notifications@github.com

wrote:

@ashander https://github.com/ashander Agree that this task view needs splitting.

@leeper https://github.com/leeper I like that proposed summary paragraph - I think we have some of that, but could use more for sure.

If it sounds like the listicle of packages part of this repo should be split off, I don't think that needs to be a task view per se - it could just be a github repo and not get sent to cran. Maybe even a gh-pages site.

I agree with @leeper https://github.com/leeper that it's blurry distinction between services and data sources. Though if we put the 2nd half in a separate repo, we could attempt to split those at least somewhat, or have tags or something like that to support the venn diagram nature of the packages.

Particularly for users attempting to discover ways of accessing machine-readable data through R

@ashander https://github.com/ashander how can this be made easier? good enough to have a list of them with short description? or something else

— Reply to this email directly or view it on GitHub https://github.com/ropensci/webservices/issues/205#issuecomment-92880902 .

sckott commented 9 years ago

want to chime in @karthik @cboettig - no worries if no opinion on this

cboettig commented 9 years ago

Great discussion, I think @ashander makes some very good points about the value of splitting things up. I think the split into data vs tech is reasonable. I like @leeper 's suggestions about better descriptive text as well.

I do agree that some packages, like twitteR, are very much in both camps -- web tech from the perspective of posting information to a particular service, but data from the perspective of scraping a given pool of open data. No problem: I suggest such packages be listed in both -- CRAN taskviews are focused on tasks after all, and this means some packages already appear in multiple taskviews.

Do you need CRAN's blessing to split the taskview?

sckott commented 9 years ago

Do you need CRAN's blessing to split the taskview?

I can ask, i imagine it's no problem

karthik commented 9 years ago

Is it really worth splitting the task view itself as opposed to organizing it better?

leeper commented 9 years ago

I'm not sure we need to split, but I do like the idea of an open data task view, which might also allow for a single place to keep track of data-only packages that are not web-based. On Apr 14, 2015 2:14 PM, "Karthik Ram" notifications@github.com wrote:

Is it really worth splitting the task view itself as opposed to organizing it better?

— Reply to this email directly or view it on GitHub https://github.com/ropensci/webservices/issues/205#issuecomment-93024700 .

sckott commented 9 years ago

@leeper

data-only packages that are not web-based.

do you mean pkgs that come with data, and don't make web requests for data?

leeper commented 9 years ago

@sckott Yes, I'm thinking like https://github.com/hadley/nycflights13 and https://github.com/jennybc/gapminder

ashander commented 9 years ago

I outlined the main rationale for splitting above, which comes down to matching the title to the task of data access.

Better organization would help, and makes sense especially for the web tech side of things. For data access, I think the current organization is fine but the name of the task view doesn't describe the task, making it less visible to users

On Tue, Apr 14, 2015 at 2:43 PM, Thomas J. Leeper notifications@github.com wrote:

@sckott https://github.com/sckott Yes, I'm thinking like https://github.com/hadley/nycflights13 and https://github.com/jennybc/gapminder

— Reply to this email directly or view it on GitHub https://github.com/ropensci/webservices/issues/205#issuecomment-93075756 .

sckott commented 9 years ago

@karthik worth splitting? there's not really any cost to splitting - i guess in a new repo that splits off we'd lose the github stars, but other than that...

@leeper that sounds good, to include non-web data sources - BUT i wonder where that stops though, cause lots of packages have at least a small dataset or two. So would focus be just on data only packages? Primary only data?

leeper commented 9 years ago

Yes, there would have to be a clear set of boundaries. An open data task view probably shouldn't describe every trivial dataset available in every package, but instead focus on:

data discovery (figshare, dvn, dryad, OAIHarvester, etc.)
data archiving (some of the same tools as the previous point)
data-only packages (like the ones I cited previously)
open (web-based) data available via packages (from, for example, rOpenSci, rOpenGov, rOpenHealth, etc.)
others?

sckott commented 9 years ago

@leeper Okay, sounds good

sckott commented 9 years ago

@ashander @leeper I think we should go ahead with a split, with

web tech and open data

we should definitely do as leeper said and make more useful narrative summaries of each section

and its okay if the same package is mentioned in both

sckott commented 9 years ago

open data started https://github.com/ropensci/opendata

ashander commented 9 years ago

Cool. I'm happy to contribute a draft PR here implementing the split outlined above. I could do this next week.

Is there any desire or need to keep git history for those lines that will end up only in open data ?

On Fri, Apr 17, 2015 at 2:27 PM, Scott Chamberlain notifications@github.com wrote:

open data started https://github.com/ropensci/opendata

Reply to this email directly or view it on GitHub: https://github.com/ropensci/webservices/issues/205#issuecomment-94078728

ashander commented 9 years ago

Of course, I have no issue if someone else wants to draft it either. In any case, I've edited the discussion goals checklist above to reflect decision. Now there's only

[ ] draft the split into web tech and open data along lines discussed above

sckott commented 9 years ago

@ashander maybe we should copy contents of this into the new repo with all git history, etc, but with the new repo name, and diverge from there for open data (and remove anything from here as needed). @leeper does that make sense? Or another way?

ashander commented 9 years ago

Yes that seems fine. I think in practice split could be made as two PRs

to current github repo removing open data stuff
to new github repo adding open data stuff

To retain history to new repo, basing the second of these on a clone of this repo makes sense

- Jaime

On Fri, Apr 17, 2015 at 2:57 PM, Scott Chamberlain notifications@github.com wrote:

@ashander maybe we should copy contents of this into the new repo with all git history, etc, but with the new repo name, and diverge from there for open data (and remove anything from here as needed). @leeper does that make sense? Or another way?

Reply to this email directly or view it on GitHub: https://github.com/ropensci/webservices/issues/205#issuecomment-94084134

sckott commented 9 years ago

@ashander sounds good, PR here to remove - but after doing PR to new repo with open data content (or clone)

leeper commented 9 years ago

:+1: Nice to keep the full history in both repos to keep track of the fork.

leeper commented 9 years ago

I just initiated a pull request to the opendata repo that basically contains most of the content from here (except the web frameworks section and some packages that are strictly services not data) and adds a bunch of data packages. It still needs to be cleaned up quite a bit, but it should get us started. Then we start trimming this repo down to just be webtech packages.

I also opened some issues today on this repo that are probably actually just data packages that belong on the other repo, so I can migrate those later.

sckott commented 9 years ago

:+1:

leeper commented 9 years ago

I'm working on the rewrite of this task view now and will send a commit shortly.

ashander commented 9 years ago

Thanks @leeper clearly last week's todo list had slipped into this week

If it's not a duplication of effort, I'll put in some work on cleaning up the new open data task view

- Jaime

On Mon, Apr 27, 2015 at 8:15 AM, Thomas J. Leeper notifications@github.com wrote:

I'm working on the rewrite of this task view now and will send a commit shortly.

Reply to this email directly or view it on GitHub: https://github.com/ropensci/webservices/issues/205#issuecomment-96697231

leeper commented 9 years ago

@ashander Sounds great!

leeper commented 9 years ago

Okay, I've committed a draft basically trying to follow the outline you suggested, @ashander. I'm not sure I was wholly successful because the first list seems too long to me and the last section is just a list of miscellaneous stuff that's not really organized in any way (and we can probably cut some of the packages b/c they'll be represented on OpenData but wasn't sure in some cases). I cut most of the previous headers because almost all of the sections would have only had a few bullet points (given that they were mostly data packages that are now gone).

@sckott Thoughts?

ashander commented 9 years ago

This looks really good!

For the first list, could you structure into subheads for 1) http related, 2) low-level parsing structured data into R (XML, JSON), 3) Other: Auth, Jscript, email, misc ?
The Web Services section seems like it could be more structured: common tasks using web services seems like it could be more structured. POssible additions: 1) cloud compute, 2) cloud storage, 3) document sharing, 4) geospatial
I agree the last list looks like it will slim down if we remove some listed in opendata that don't have a strong hook to web servicess

leeper commented 9 years ago

@ashander great points! I've made some further revisions - basically adding some headings and moving things around. I also dropped some more packages from the final list and consolidated all of the ones related to scholarly metadata.

The one big task we need to do is cleanup the "Web Frameworks" section. Presumably it should be (1) Shiny (and uses thereof), (2) OpenCPU, (3) server tools, (4) miscellaneous?

ashander commented 9 years ago

Awesome. I have some real work to do on opendata to get it up to this standard.

That breakdown seems good in theory, but within current "web and server frameworks" sections 1) shiny would include shiny and one other package, 2) OpenCPU would only include itself. I'm not sure of tools vs misc from a quick look.

Here's a concept that avoids that singleton issue:

rename section to "Server-side tools and frameworks" then have a section for full-featured frameworks (opencpu, shiny), simple interfaces to basic protocols (servr, websockets, httpuv), and misc (everything else)

leeper commented 9 years ago

@ashander, I like your suggested breakdown. We could also have Shiny as a section with discussion of packages to customize it (shinythemes, shinybootstrap2, etc.), and then document a few examples of shiny packages on CRAN (e.g., radiant, ggvis, etc.) or elsewhere.

That still leaves opencpu by itself. I haven't worked with it enough to write a substantial text describing it. @jeroenooms, would you be able to give us a blurb for that?

cran-task-views / WebTechnologies