Closed eMPee584 closed 9 years ago
I just sent :email: to them. So they might fix it soon.
Yes those citations files are a bit of a moving target... they were previously issued by DFR as CSV files, then they suddenly switched to TSV without warning. And they are pretty dirty, as you say, with trailing tabs and column headers off by one.
Just to be sure I understand, are you having a specific problem with this package, or commenting on the challenges of working with the citations file?
Both. R will not read in these files unalteredly with any option set I tried, and neither will the current code in this repo. The workarounds you applied might superficially ~work, but it most definitly is not the right approach.
I can read in the citations.tsv
into R correctly if I either append a TAB char to the first line, or remove all trainling \t from the others. If I clean the file up in this manner, your workarounds of course stop working.
JSTOR_unpack1grams(path = "/K/dfr-jstor/2015.3.22.AZuKdh9R") reading 1-grams into R... |======================================================================| 100% done reshaping the 1-grams into a document term matrix... |======================================================================| 100% done arranging bibliographic data... Error in t.default(do.call("c", lapply(list(...), as.TermDocumentMatrix))) : argument is not a matrix
Can you share the zip file you got from DFR?
Mailed you a test file I generated couple of weeks ago. The citations.tsv contains empty fields as well, so some lines contain subsequent tabs. After replacing the last tab in each line through sed -i 's/\t$//' citations.tsv
I can import the file properly via data <- read.delim("/K/dfr-jstor/2015.3.18.F4wNU2c6/citations.tsv", sep="\t", row.names = NULL, comment.char = "", header = TRUE, stringsAsFactors = FALSE, colClasses="character", quote = "")
and str(data)
shows everything is in the right place.
(Only row 6 and 8 have abstracts, the others miss it or are book reviews/misc)
Thanks, I've got your zip (I assume this is exactly as you got it from DFR, is that right?) and the JSTOR_unpack1grams
function works fine on it for me.
I use the rocker/hadleyverse docker container, which is a Debian Linux OS. If you're using a different OS then I'm afraid I'm not sure how I can help you further (I recommend using docker for reproducibility and isolation)
devtools::install_github("benmarwick/JSTORr")
library(JSTORr)
unzip("2015.3.18.F4wNU2c6.zip")
# unpack
unpack1grams <- JSTOR_unpack1grams()
# inspect
inspect(unpack1grams$wordcounts[,1:5])
<<DocumentTermMatrix (documents: 4, terms: 5)>>
Non-/sparse entries: 8/12
Sparsity : 60%
Maximal term length: 6
Weighting : term frequency (tf)
Terms
Docs school dublin taylor said film
10.2307_29792384 20 16 15 13 12
10.2307_40688303 6 0 0 6 0
10.2307_41131014 0 0 0 0 0
10.2307_41335580 0 0 0 1 0
# find frequent terms in this dataset
sort(colSums(as.matrix(unpack1grams$wordcounts)))
common equitable corporation rodiny
41 42 43 43
business current case electric
45 47 49 50
action will stockholders motor
52 54 55 56
jako power equity corporate
63 65 67 74
directors delaware court board
77 84 87 154
# have a look at 'court'
term_court <- JSTOR_1word(unpack1grams, "court")
# plot
term_court$plot
# I get a plot
# this term is only in two docs...
term_court$word_by_year
word_ratio year V2
1: 1.182732 2009 10.2307_29792384
2: 10.365854 2005 10.2307_40688303
3: 0.000000 1995 10.2307_41131014
4: 0.000000 1901 10.2307_41335580
sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C LC_COLLATE=C
[5] LC_MONETARY=C LC_MESSAGES=C LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_0.6.2 tm_0.6 NLP_0.1-6 plyr_1.8.1
[5] testthat_0.9.1 scales_0.2.4 ggplot2_1.0.0 reshape2_1.4.1
[9] data.table_1.9.4 slam_0.1-32 JSTORr_1.0.20150226
loaded via a namespace (and not attached):
[1] FactoMineR_1.29 MASS_7.3-37 Matrix_1.1-5 Rcpp_0.11.4
[5] SparseM_1.6 XML_3.98-1.1 apcluster_1.4.1 car_2.0-25
[9] chron_2.3-45 cluster_2.0.1 colorspace_1.2-4 digest_0.6.8
[13] flashClust_1.01-2 ggdendro_0.1-15 grid_3.1.2 gridExtra_0.9.1
[17] gtable_0.1.2 igraph_0.7.1 labeling_0.3 lattice_0.20-29
[21] lda_1.3.2 leaps_2.9 lme4_1.1-7 mgcv_1.8-4
[25] minqa_1.2.4 munsell_0.4.2 nlme_3.1-119 nloptr_1.0.4
[29] nnet_7.3-8 openNLP_0.2-4 openNLPdata_1.5.3-1 parallel_3.1.2
[33] pbkrtest_0.4-2 proto_0.3-10 quantreg_5.11 rJava_0.9-6
[37] scatterplot3d_0.3-35 snowfall_1.84-6 splines_3.1.2 tools_3.1.2
The ZIP file is unaltered.
the JSTOR_unpack1grams function works fine on it for me.
If you print the data read in you'll see the column names don't match up with the contents. "Works fine" because you manually shifted the indices (the workaround I was referring to). 'course that's one way to deal with it. Maybe they fix their file generation. The workaround has to be reverted in that case.
uhmm...when I tried some manual parsing on the file, I was checking up on JSTOR_unpack1grams.R for reference.. but always had a shift in the data. Turns out, DFR JSTOR is generating a file that I couldn't find any way to read.table() in correctly because it contains TRAILING TABS EVERYWHERE EXCEPT THE HEADER ROW. Easiest fix for me was to remove these trailing tabs outside of R with
sed -i 's/\t$//' citations.tsv
.. I triedsed -i '1 s/$/\t/' citations.tsv
initially which also works but is moreStupid™. So the offset you tried to compensate for in the unpack1gram function proly stems from that.# note that citation type is not in the correct column
It didn't work out anyway from a quick glance at the resulting. And it obviously fails on the TSV with trailing tabs removed. So.. I guess you will come up with better ideas about how to deal with this than me. :+1: