benmarwick / JSTORr

Simple text mining of journal articles from JSTOR's Data for Research service
Other
71 stars 18 forks source link

Subscript out of bounds when opening JSTOR DFR set using JSTOR_unpack1grams, JSTOR_unpack2grams #20

Closed gleemie closed 10 years ago

gleemie commented 10 years ago

Hi there -

I just installed and loaded up my JSTORR dataset to try out your promising package. I wanted to let you know about a bug. Your unpackNgram functions rely on there being a citations.csv file. However, the csv JSTOR dataset I just received contains a citations.tsv file, not a csv file. (The rest of the data is fine in csv.) I think the reason is that the JSTOR abstracts sometimes have commas in them so the csv table reading function chokes, thinking the column has ended. (I figured this out because I converted the TSV to a CSV so JSTORR could read it but then the function threw an error that there were more columns than headers. I went into the CSV and replaced all the commas in the CSV file with spaces in excel, saved it, and successfully loaded it through unpack1gram into R.)

The upshot is that JSTORR users will be using your libraries with tab-separated citations files rather than comma separated ones.

Thank you! Lilly

benmarwick commented 10 years ago

From the error message you've got it seems that the DFR service have recently made some changes to the way they prepare the archive, that might explain your #21 also.

Can you share with me your originial unaltered zip file that you got from dfr.jstor.org? Is that what you put on dropbox? I can't see the citation file anywhere in there.

gleemie commented 10 years ago

I added the unaltered zip to the Dropbox directory I shared with you, here: https://www.dropbox.com/sh/mol6dzx74bvoqbo/AAATnFVSSUfA5ni5p2oYcF1Ta?dl=0 The citations file JStor gave me (in the linked dropbox directory) is citations.tsv. I then created and fixed a citations.csv, also in the directory. The zip, again, should be there now as 2014.9.23.tRfErNXY.zip

benmarwick commented 10 years ago

I've now made a few changes (53b2ebfae6426a9468b787445af528e5b18ec1d1) and tested the package on your dataset and another of mine. These functions work for me with your dataset:

JSTOR_1word(unpack1grams, "society")
JSTOR_2words(unpack1grams, "army", "navy")
JSTOR_2wordcor(unpack1grams, "army", "navy")
JSTOR_1bigram(unpack2grams, "world peace")
JSTOR_2bigramscor(unpack2grams, "world peace", "world war")

Let me know how you go.

gleemie commented 10 years ago

Thanks! I'm new to R / Github. is there a command like install_github("benmarwick/JSTORr") that I can use to pull down this particular commit? Or did you push the fix to the main branch such that running install_github("benmarwick/JSTORr") again will get your new code to try?

benmarwick commented 10 years ago

Yes, the fix commit went to the master branch (I've only got that one branch), so if you do

devtools::install_github("benmarwick/JSTORr")

then you'll get the most recent version

gleemie commented 10 years ago

Hi there! I got a chance to download your new code and try and run it. Now I get through reading the 1-grams and reshaping the 1-grams into a matrix, but the command ends in an error. I've pasted the message below: reading 1-grams into R... |==============================================================| 100% done reshaping the 1-grams into a document term matrix... |==============================================================| 100% done arranging bibliographic data... Error in setwd(path) : cannot change working directory

I'm not sure why I would get this error on my machine but you wouldn't get it on yours. I'm running RStudio on Mac OS 10.9.4.

benmarwick commented 10 years ago

Can you copy me the output from traceback() after you run that line?

gleemie commented 10 years ago

unpack1grams <- JSTOR_unpack1grams(path="Dropbox/research-current/Data Sets/2014.9.23.tRfErNXY/") reading 1-grams into R... |==============================================================| 100% done reshaping the 1-grams into a document term matrix... |==============================================================| 100% done arranging bibliographic data... Error in setwd(path) : cannot change working directory traceback() 2: setwd(path) 1: JSTOR_unpack1grams(path = "Dropbox/research-current/Data Sets/2014.9.23.tRfErNXY/")

On Tue, Sep 30, 2014 at 2:44 AM, Ben Marwick notifications@github.com wrote:

Can you copy me the output from traceback() after you run that line?

— Reply to this email directly or view it on GitHub https://github.com/benmarwick/JSTORr/issues/20#issuecomment-57242018.

Lilly Irani University of California, Irvine http://www.ics.uci.edu/~lirani/

benmarwick commented 10 years ago

Thanks, I made another test, this time one that you might be able to reproduce. I used boot2docker to start a docker instance (kind of like a slim virtual machine), then at the docker command line ran this

docker run -d -p 8787:8787 benmarwick/ropensci

which will download my docker image and start a container with R & RStudio. I then go to my browser at localhost:8787 and log into RStudio with username: rstudio and password: rstudio That gives me a self-contained linux environment for running R code in RStudio. Then in RStudio I ran these lines:

# install package and load library
devtools::install_github("benmarwick/JSTORr")
library(JSTORr)

# get zip file of DFR data
dir.create("my_folder")
setwd('~/my_folder')
temp <- tempfile()
# I uploaded your zip file to my faculty page so I could get a direct download of the zip
download.file("http://faculty.washington.edu/bmarwick/2014.9.23.tRfErNXY.zip", temp, mode = "wb")
unzip(temp)
unlink(temp)

# run unpack function 
unpack1grams <- JSTOR_unpack1grams(path = "~/my_folder")

And it worked just fine, so I can't reproduce your problem. It may be that the unpack1grams function is sensitive to the working directory that you start from, or maybe having that final / in the path argument. Let me know how you go testing those options (ie. remove that final / from path = "Dropbox/research-current/Data Sets/2014.9.23.tRfErNXY/"). If that's the problem then I'll edit the function to deal with it, but neither of those make a difference with my docker tests, so maybe it's something else, something mac specific or maybe the space in Data Sets?

Have a go with docker and see if you can make it work there.

gleemie commented 10 years ago

Thanks for this.

I figured it out. The tilde needs to be in the path. My path parameter was relative to the working directory I'd already set before running these commands (setwd("~")). If you want the code to be more robust, you could translate the path parameter the user passes in from a relative to an absolute path before setting it as the working directory.

Thank you again and sorry to create hassle! I look forward to working with this.

On Tue, Sep 30, 2014 at 9:48 AM, Ben Marwick notifications@github.com wrote:

Thanks, I made another test, this time one that you might be able to reproduce. I used boot2docker http://boot2docker.io/ to start a docker instance (kind of like a slim virtual machine http://www.zdnet.com/what-is-docker-and-why-is-it-so-darn-popular-7000032269/), then at the docker command line ran this

docker run -d -p 8787:8787 benmarwick/ropensci

which will download my docker image and start a container with R & RStudio. I then go to my browser at localhost:8787 and log into RStudio with username: rstudio and password: rstudio That gives me a self-contained linux environment for running R code in RStudio. Then in RStudio I ran these lines:

install package and load library

devtools::install_github("benmarwick/JSTORr") library(JSTORr)

get zip file of DFR data

dir.create("my_folder") setwd('~/my_folder') temp <- tempfile() download.file("http://faculty.washington.edu/bmarwick/2014.9.23.tRfErNXY.zip", temp, mode = "wb") unzip(temp) unlink(temp)

run unpack function

unpack1grams <- JSTOR_unpack1grams(path = "~/my_folder")

And it worked just fine, so I can't reproduce your problem. It may be that the unpack1grams function is sensitive to the working directory that you start from, or maybe having that final / in the path argument. Let me know how you go testing those options (ie. remove that final / from path = "Dropbox/research-current/Data Sets/2014.9.23.tRfErNXY/"). If that's the problem then I'll edit the function to deal with it, but neither of those make a difference with my docker tests, so maybe it's something else, something mac specific.

Have a go with docker and see if you can make it work there.

— Reply to this email directly or view it on GitHub https://github.com/benmarwick/JSTORr/issues/20#issuecomment-57278451.

Lilly Irani University of California, Irvine http://www.ics.uci.edu/~lirani/

benmarwick commented 10 years ago

Thanks, glad you got it sorted. Don't hesitate to open another issue if you run into anything else, etc.