timevis can leak (potentially) private information

greg-minshall commented 6 years ago

the entire dataframe provided as the 'data' argument to timevis (same for 'groups' argument) ends up in the resulting HTML file, including columns not actually used by timevis/vis.js's Timeline. if the user has passed an existing dataframe, this can cause information leakage.

daattali commented 6 years ago

A bit more context:

In order to create a timevis widget, the user calls the timevis(data) function in R, where data is a data.frame containing the data to show, and it needs to have some specific columns in order for the visualization to work. For example:

data <- data.frame(content = c("event 1", "event 2"),
                   start = c("2016-01-10", "2016-01-12"),
                   end = c("2016-01-13", "2016-01-14"))
timevis::timevis(data)

results in

Since this package is a binding to a third party javascript library and that javascript library is responsible for building the visualization, the data dataframe that is given to the timevis() R function gets passed directly to the javascript layer. This dataframe is accessible in the javascript console as a json object. This means that if you have additional columns in the dataframe that are not actually needed to build the visualization, they will still be available, hence the potential "information leakage".

I want to hear from others if they think this is a big security issue and if timevis should explicitly remove any non standard columns before passing the data to javascript. I don't love that idea because 1. then the package will have to play catchup with visjs to make sure we always have a full list of all the supported columns, and 2. it's possible that the names of columns is not a finite list, maybe there are some options that allow the user to define custom columns that are used (the documentation for visjs is long with a lot of options and I haven't gone through it in detail recently).

I think simply documenting the data parameter to say that the entire dataframe will be visible to the client is sufficient. Would like to hear some opinions.

I wonder if other htmlwidgets whitelist columns in the data they use?

timelyportfolio commented 6 years ago

This is entirely my opinion, but for what it is worth...

networkD3 whitelists columns but not for security, and I often find myself adding back lost data. As far as I know, no other htmlwidgets constrain passed data. leaflet might (but again unintentionally I believe) depending on the method. While users should be aware that all data might get passed as json, I don't think the htmlwidget author should bear responsibility for determining what columns might or might not be important and what columns might or might not contain sensitive information.

daattali commented 6 years ago

@greg-minshall My thoughts are along the same as what @timelyportfolio said. For now I would feel better about just adding a few words in the parameter documentation rather than making assumptions about and changing the data being passed in

greg-minshall commented 6 years ago

hi. so, i sort of convinced myself other than the two of you. let me see if i can explain, both my proposed solution, and (harder) my rationale. my proposed solution:

timevis() drops unknown columns from the input dataframe. but, it does so "noisily", i.e., by issuing a warning. the warning includes a comment that hitherto unknown (to timevis()) columns will be allowed through if they are named in the (brand new) "passcolumns=" parameter.
the "passcolumns=" parameter, nominally NULL, can contain a list ("passcolumns=c("mightynew", "newest")) naming additional columns to allow through.

the warning in #1 is to (try to) prevent mystification of the user when their "mightynew" column doesn't seem to be having any effect. "passcolumns=" then decouples, to some extent, releases of timevis() from upgrades in vis.js() Timeline.

my reasoning is that security is a hard problem. while we can't fix the problem, we can semi-plug one hole, maybe stopping some very embarrassing leak from happening in the future. (if we do prevent it, we'll never know it; if we don't, we'll possibly hear about it.) on the other hand, every small possibility increases the overall likelihood of security failures.

daattali commented 6 years ago

I do understand your viewpoint eg. this is a security issue that timevis can potentially take some measure against. But in my opinion this is overstepping what timevis is, it's simply a visualization, if someone has sensitive data then it's on the user to not pass that sensitive data around if not needed, to any function. You shouldn't pass it to R functions if it's truly sensitive information that is not needed for the function you're calling, because every function you call is a blackbox inside and you don't know what it does with it (unless you actually look at the source code). To me it doesn't seem like it makes sense for such a simplistic project that's an R binding to a JS library to try to solve this problem.

If others will agree that it's the correct thing to do, I'm open to changing my mind, which is why I specifically asked on Twitter for people to comment here if they have an opinion.

Or, alternatively, if you'll find several other htmlwidgets that also have that line of thinking that variables should be whitelisted for security reasons, then it would also prompt me to reconsider. But the argument of a security hole sounds too weak for me because it seems like the user should just not use a dataframe with information that they really don't want to share.

I'll leave this issue open for a while to see if others chime in!

greg-minshall commented 6 years ago

Dean, @timelyportfolio,

one thing to keep in mind is that, by definition, the user we are protecting is the user who (after the warning is on the man page) is, in fact, doing the wrong thing, making a mistake. the user who reads, understands, and pays attention to the warning is not the person we are trying to protect. rather, it is the person who, for whatever reason, doesn't do all those things. but, you know how we are when we are trying to get something to work, we read this and that, look at google, stackexchange, etc. then, we try this and that, possibly forgetting -- certainly at my age! -- some things we previously read while doing so. the programmer who reads all the documentation, follows all the guidelines, imho, is a bit like homo economicus.

in a scenario, the department clerk is told to put together a presentation for the next City Council meeting of what how the department employees are scheduled for the next six months. remembering that Janet did something similar, he asks and she sends him some code, using timevis. working on that, the clerk notices Janet gets the employee task database using the office API (which maybe involves, behind the curtains, a few joins), and then forms a new datafram with "name", "group", etc. but, the clever clerk says, i already have a dataframe, i'll just rename a few columns ("Name" to "name"; "Section" to "group", etc.), pass it to timevis(), and wow! since the meeting is public, the resulting web page is published online for the council members, and interested members of the public, to see. to see (well, for hackers), including, the non-timevis, non-vis.js columns such as mobile phone numbers, salary, etc.

timevis is a really nice package, and hopefully it will get an expanding user base. but, the likelihood of a leak is, i would guess, proportional to the number of users. (of course, i already did a leak; luckily not that horrible a one, just some e-mail address that, presumably, nobody noticed.)

so, obviously it's you decision, but i still offer up my solution (which i'm happy to code up): new columns can be passed, but they must be explicitly specified as new columns.

greg-minshall commented 6 years ago

i think i did a boo boo. reopening.

daattali commented 6 years ago

You did in fact boo boo, let's keep it open :) I completely understand your argument. However, I still personally think that would be overstepping the boundary of the package, and that if someone has sensitive info the burden of being careful with it and not giving it away to other services is on them, not on the service to try to throw away what's given to it. This is clearly not a black and white matter and involves some personal judgment on what the solution (or no solution) is, so it would help if I would see that other htmlwidgets do take this into consideration. I haven't looked, but I don't think they do, and I imagine it's from similar reasons to mine. But I may be wrong .

daattali commented 5 years ago

As mentioned in PR #42 , instead of keeping this open forever, I'm closing this issue as a non-fix

daattali / timevis

timevis can leak (potentially) private information #44

hi. so, i sort of convinced myself other than the two of you. let me see if i can explain, both my proposed solution, and (harder) my rationale. my proposed solution:

the "passcolumns=" parameter, nominally NULL, can contain a list ("passcolumns=c("mightynew", "newest")) naming additional columns to allow through.