ProjectMOSAIC / mosaic

Project MOSAIC R package
http://mosaic-web.org/
94 stars 25 forks source link

stringsAsFactors = FALSE not documented in read.file() #502

Open dtkaplan opened 8 years ago

dtkaplan commented 8 years ago

I propose to document this so that later changes will not be made without discussion. Reading in strings as strings simplifies tasks such as data cleaning and I rely on it in the Data Computing book.

rpruim commented 8 years ago

See #474. If you are using read.file(), let us know ASAP. It is slated for removal from the package with the idea that we should get out of the data import/export business and leave that to the newer packages dedicated to the task.

I may simply move it to fastR or perhaps to a new package I'm creating with miscellaneous utilities. Currently named rjp (renaming suggestions welcomed), it only contains two functions at the moment:

nicholasjhorton commented 8 years ago

Just FYI, I've been very happy with the qdapRegex package for things like phone number, zip code extraction, etc.

http://trinker.github.io/qdapRegex/

rpruim commented 8 years ago

I took a quick look. I'm not crazy about the API for phone numbers since (a) it returns a list of vectors, (b) it doesn't standardize the phone number format, and (c) it uses a functions called rm_* to do extraction. (Seems to me like extraction would be the more common use case than deletion.)

But my use case is different, I'm not extracting phone numbers from large chunks of text. I'm trying to standardize the phone numbers entered by users in a google form. The approach is pretty simple: remove all the non digits, convert the rest to a numeric, and use some arithmetic to parse into area code, exchange and number. At the moment there is only one output format, but once those three elements are in hand, it would be easy to provide more flexibility there.

For more general scraping, I think the gdapRegex package might be more useful.

dtkaplan commented 8 years ago

Thanks for the reminder about read.file(). I should have remembered that discussion about leaving file reading to packages designed specifically for file reading. I'll remove the mosaic::read.file() from Data Computing.

But what to do?

I propose to do the following, which isn't really a mosaic issue but would benefit from your critical eye. I'll define read_file() in the DataComputing package. It will look very much like read.file(). I may arrange it to use data.table::fread() internally rather than read_csv() and then explicitly convert to a dplyr-friendly format with as.tbl(). In the data scraping chapter I refer to read_csv() and fread(), so advanced users will get a heads up that read_file() is connected to DataComputing.

I hate to add in yet another file-reading function, but neither read_csv() nor fread() do quite what I need.

rpruim commented 8 years ago

I'm with you on this one and am happy to work at improving read.file() or morphing it into read_file(). I like to have something simple that takes care of lots of cases and introduce other tools to handle the challenging cases or to have more control.

By the way, I modified read.file() to use read_csv() in some cases, but I too have found it a bit brittle. I'd be interested in exploring the data.table::fread() + format conversion option. I think I prefer that the result be a straight up data frame. tbl_dfs are confusing and unexpected if you aren't using dplyr and dplyr handles them just fine. I don't find the pretty printing of tbl_dfs to be enough of a reason to default that way. (In fact, the pretty printing often annoys me, especially when I really do want to see all 15 rows of my little data set.)

So the main question is where does the improved read.file() belong? More and more, I'm sold on the value of small packages with tight focus. Do we have anything else that this belongs with? I would suggest NOT putting into the Data Computing package -- its use is largely independent of that course, or even that material. It doesn't really belong in fetch; that's more about pulling from web sources.

I'd be happy enough to start a new package with only this function for now and add too it if/when we have things to add. (Perhaps things like my phone number extractor and other similar tools used to tidy up the data after reading it in could also go here.)

dtkaplan commented 8 years ago

A new, small package would work well for me.

I send the Data Computing book to the printer before Monday, so I have to settle on names before then.

My revised plan: I'll keep using read.file() in the text. Currently, I export it when DataComputing is loaded so that I don't have to preface it with mosaic::. I'll change DataComputing to export the new-and-improved read.file() as soon as that's available, and the text will automagically still be valid.

If we can come up with a name for the file-reading package (teachingFiles maybe, or just pop it in mosaicData), then I'll make the trivial change in the data scraping chapter of Data Computing to switch the description of read.file() to refer to the new package.

rpruim commented 8 years ago

I think we can do better than that. If we are only putting one function into a package, we should be able to do that this weekend. We can put the current read.file() there now and improve upon it from there.

I'll think about a name and create the package in our ProjectMOSAIC org. I'd like to move read.file() in a way that maintains its git commit history, so I'll see if I can figure that out.

nicholasjhorton commented 8 years ago

I just wanted to chime in on my general sense that having a multiplicity of packages is a bad idea if the goal is to foster usage by a broader population. I certainly won't stand in the way, but suggest exploring something like the:

mosaicTeach

package that pulls these various parts and pieces together.

Personally, I think that having the mosaic package provide this umbrella would be preferable to the profusion of packages. However, I acknowledge that this may be a minority opinion and may not even be feasible due to CRAN restrictions.

rpruim commented 8 years ago

Packages can depend on other packages (like mosaic does on mosaicData), so it is in theory always possible to create meta-packages for that. But CRAN checking a large package takes forever, and we have to go through the entire process every time we make the smallest change to repair or improve something.

I don't think mosaicTeach is a good name for a package containing read.file() since it is not primarily a function that serves "teaching". If there were a mosaicTeach package, it should include things targeted at instructors for making it easier to teach (perhaps by pulling in other packages all in one go, as Nick is suggesting).

read.file() belongs in a package dedicated to data ingestion, tidying, etc. The problem is that there already are packages doing this (and using up good names).

Putting read.file() into mosaicData (as Danny suggested) would be another option. The package would then expand to include both data sets and ways to read them from files and clean up the resulting data frame. That's a reasonable scope for a package, but I think I'd rather have the mosaicData package just have data sets. (We could always decide later to have mosaicData import the new package, if we think one should always have the functions when one has the data sets.)

Hmm.. should derivedFactor() move to this new package as well?

nicholasjhorton commented 8 years ago

I hope not.

Just my $0.02,

Nick

On Aug 7, 2015, at 10:26 AM, Randall Pruim notifications@github.com wrote:

Hmm.. should derivedFactor() move to this new package as well?

Nicholas Horton Professor of Statistics Department of Mathematics and Statistics, Amherst College Box 2239, 31 Quadrangle Dr Amherst, MA 01002-5000 https://www.amherst.edu/people/facstaff/nhorton

rpruim commented 8 years ago

We left this conversation sort of dangling. @dtkaplan, what did you end up doing with read_file()? Currently, read.file() remains in mosaic.