HumanitiesDataAnalysis / hathidy

Download and manipulate HathiTrust wordcount data in the tidyverse
MIT License
9 stars 0 forks source link

Problems with missing libraries on non-prebuilt packages. #5

Open standap opened 3 years ago

standap commented 3 years ago

Hello Ben, I followed your vignette at https://humanitiesdataanalysis.github.io/hathidy/articles/Hathidy.html, but when I tried to pull the counts for all the Gibbon's books with gibbon_books = hathi_counts(gibbon, cols = c("page", "token")) %>% inner_join(gibbon_vols) I got error

by must be supplied when `x` and `y` have no common variables.
ℹ use by = character()` to perform a cross-join.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
Unknown or uninitialised column: `htid`. 

I was able to work with your script on an individual item, "nyp.33433081597290" but not on the whole set.

bmschmidt commented 3 years ago

Oops, my apologies. It looks like I neglected to update the pkgdown pages with the vignettes when the rest of the package was bumped from 1.0 to 2.0. Let me see what I can do about that.

bmschmidt commented 3 years ago

OK, the website is updated to 2.0 with an additional vignette that shows the use of quanteda functions on Hathi wordcounts. But it looks like there was also a missing merge from the dev branch fixing a conflict between the old id name ("id") and the new one("htid"). So if you reinstall from github, it should work now.

standap commented 3 years ago

Thank you for looking into this. That was super fast. I have reinstalled the package, but I am still not getting the dataframe. The json files are downloaded into the local directory as expected, but the "unknown or uninitialised column: htid." warning persists.

Screenshot from 2021-06-07 15-31-07

Screenshot from 2021-06-07 15-34-48

bmschmidt commented 3 years ago

Hmm, weird. What kind of system is this? Those feather files should not be zero bytes, you're right to flag it. Maybe try:

  1. Completely deleting the folder at Desktop/hathiTrust_intro/hathi-features/, restarting R, trying again;
  2. arrow::arrow_info() to see if you have a version of arrow > 2.0 and if zstd compression is enabled.
  3. gibbon_books = hathi_counts(gibbon, cols = c("page", "token"), cache=FALSE) %>% inner_join(gibbon_vols) which should run but substantially slower than the feather caching.
standap commented 3 years ago

Thanks, Ben for your quick response and the pointers. I am on Ubuntu 21.04; R version 4.0.4 (2021-02-15); RStudio Version 1.4.1106

It seems that it can be all linked to the arrow package. After I run the arrow::arrow_info() function, all the compression methods were to set to FALSE, so I reinstalled the package with install_arrow(binary = FALSE, minimal = FALSE), following https://stackoverflow.com/questions/63096059/how-to-get-the-arrow-package-for-r-with-lz4-support. Once the I reinstalled the arrow package, everything works and the feather have non-zero sizes.

 ├── [    3059234 Jun  8 09:19]  nyp.33433081597191.feather
        ├── [     236415 Jun  8 09:19]  nyp.33433081597191.json.bz2
        ├── [    2409754 Jun  8 09:19]  nyp.33433081597290.feather
        └── [     193192 Jun  8 09:19]  nyp.33433081597290.json.bz2