benmarwick / JSTORr

Simple text mining of journal articles from JSTOR's Data for Research service
Other
71 stars 18 forks source link

JSTOR citations.tsv is'a b0rkedz #24

Closed eMPee584 closed 9 years ago

eMPee584 commented 9 years ago

uhmm...when I tried some manual parsing on the file, I was checking up on JSTOR_unpack1grams.R for reference.. but always had a shift in the data. Turns out, DFR JSTOR is generating a file that I couldn't find any way to read.table() in correctly because it contains TRAILING TABS EVERYWHERE EXCEPT THE HEADER ROW. Easiest fix for me was to remove these trailing tabs outside of R with sed -i 's/\t$//' citations.tsv .. I tried sed -i '1 s/$/\t/' citations.tsv initially which also works but is moreStupid™. So the offset you tried to compensate for in the unpack1gram function proly stems from that. # note that citation type is not in the correct column It didn't work out anyway from a quick glance at the resulting. And it obviously fails on the TSV with trailing tabs removed. So.. I guess you will come up with better ideas about how to deal with this than me. :+1:

eMPee584 commented 9 years ago

I just sent :email: to them. So they might fix it soon.

benmarwick commented 9 years ago

Yes those citations files are a bit of a moving target... they were previously issued by DFR as CSV files, then they suddenly switched to TSV without warning. And they are pretty dirty, as you say, with trailing tabs and column headers off by one.

Just to be sure I understand, are you having a specific problem with this package, or commenting on the challenges of working with the citations file?

eMPee584 commented 9 years ago

Both. R will not read in these files unalteredly with any option set I tried, and neither will the current code in this repo. The workarounds you applied might superficially ~work, but it most definitly is not the right approach. I can read in the citations.tsv into R correctly if I either append a TAB char to the first line, or remove all trainling \t from the others. If I clean the file up in this manner, your workarounds of course stop working.

JSTOR_unpack1grams(path = "/K/dfr-jstor/2015.3.22.AZuKdh9R") reading 1-grams into R... |======================================================================| 100% done reshaping the 1-grams into a document term matrix... |======================================================================| 100% done arranging bibliographic data... Error in t.default(do.call("c", lapply(list(...), as.TermDocumentMatrix))) : argument is not a matrix

benmarwick commented 9 years ago

Can you share the zip file you got from DFR?

eMPee584 commented 9 years ago

Mailed you a test file I generated couple of weeks ago. The citations.tsv contains empty fields as well, so some lines contain subsequent tabs. After replacing the last tab in each line through sed -i 's/\t$//' citations.tsv I can import the file properly via data <- read.delim("/K/dfr-jstor/2015.3.18.F4wNU2c6/citations.tsv", sep="\t", row.names = NULL, comment.char = "", header = TRUE, stringsAsFactors = FALSE, colClasses="character", quote = "") and str(data) shows everything is in the right place.

eMPee584 commented 9 years ago

(Only row 6 and 8 have abstracts, the others miss it or are book reviews/misc)

benmarwick commented 9 years ago

Thanks, I've got your zip (I assume this is exactly as you got it from DFR, is that right?) and the JSTOR_unpack1grams function works fine on it for me.

I use the rocker/hadleyverse docker container, which is a Debian Linux OS. If you're using a different OS then I'm afraid I'm not sure how I can help you further (I recommend using docker for reproducibility and isolation)

devtools::install_github("benmarwick/JSTORr")
library(JSTORr)

unzip("2015.3.18.F4wNU2c6.zip")

# unpack
unpack1grams <- JSTOR_unpack1grams()

# inspect
inspect(unpack1grams$wordcounts[,1:5])

<<DocumentTermMatrix (documents: 4, terms: 5)>>
Non-/sparse entries: 8/12
Sparsity           : 60%
Maximal term length: 6
Weighting          : term frequency (tf)

                  Terms
Docs               school dublin taylor said film
  10.2307_29792384     20     16     15   13   12
  10.2307_40688303      6      0      0    6    0
  10.2307_41131014      0      0      0    0    0
  10.2307_41335580      0      0      0    1    0

# find frequent terms in this dataset
sort(colSums(as.matrix(unpack1grams$wordcounts)))

common          equitable        corporation             rodiny 
41                 42                 43                 43 
business            current               case           electric 
45                 47                 49                 50 
action               will       stockholders              motor 
52                 54                 55                 56 
jako              power             equity          corporate 
63                 65                 67                 74 
directors           delaware              court              board 
77                 84                 87                154 

# have a look at 'court'
term_court <- JSTOR_1word(unpack1grams, "court")

# plot
term_court$plot
# I get a plot

# this term is only in two docs...
term_court$word_by_year

word_ratio year               V2
1:   1.182732 2009 10.2307_29792384
2:  10.365854 2005 10.2307_40688303
3:   0.000000 1995 10.2307_41131014
4:   0.000000 1901 10.2307_41335580

 sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C            LC_COLLATE=C        
 [5] LC_MONETARY=C        LC_MESSAGES=C        LC_PAPER=C           LC_NAME=C           
 [9] LC_ADDRESS=C         LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] stringr_0.6.2       tm_0.6              NLP_0.1-6           plyr_1.8.1         
 [5] testthat_0.9.1      scales_0.2.4        ggplot2_1.0.0       reshape2_1.4.1     
 [9] data.table_1.9.4    slam_0.1-32         JSTORr_1.0.20150226

loaded via a namespace (and not attached):
 [1] FactoMineR_1.29      MASS_7.3-37          Matrix_1.1-5         Rcpp_0.11.4         
 [5] SparseM_1.6          XML_3.98-1.1         apcluster_1.4.1      car_2.0-25          
 [9] chron_2.3-45         cluster_2.0.1        colorspace_1.2-4     digest_0.6.8        
[13] flashClust_1.01-2    ggdendro_0.1-15      grid_3.1.2           gridExtra_0.9.1     
[17] gtable_0.1.2         igraph_0.7.1         labeling_0.3         lattice_0.20-29     
[21] lda_1.3.2            leaps_2.9            lme4_1.1-7           mgcv_1.8-4          
[25] minqa_1.2.4          munsell_0.4.2        nlme_3.1-119         nloptr_1.0.4        
[29] nnet_7.3-8           openNLP_0.2-4        openNLPdata_1.5.3-1  parallel_3.1.2      
[33] pbkrtest_0.4-2       proto_0.3-10         quantreg_5.11        rJava_0.9-6         
[37] scatterplot3d_0.3-35 snowfall_1.84-6      splines_3.1.2        tools_3.1.2 
eMPee584 commented 9 years ago

The ZIP file is unaltered.

the JSTOR_unpack1grams function works fine on it for me.

If you print the data read in you'll see the column names don't match up with the contents. "Works fine" because you manually shifted the indices (the workaround I was referring to). 'course that's one way to deal with it. Maybe they fix their file generation. The workaround has to be reverted in that case.