JBGruber / LexisNexisTools

:newspaper: Working with newspaper data from 'LexisNexis'
103 stars 22 forks source link

Author not detected in Nexis Uni file #10

Closed JBGruber closed 5 years ago

JBGruber commented 5 years ago

hum...I am getting all NAs on Author. Not sure why.

library(LexisNexisTools) Warning message: article <- lnt_read("Files(10).DOCX") Creating LNToutput from 1 file... Reading DOCX files from Nexis Uni is experimental. Please report any problems in this issue: https://github.com/JBGruber/LexisNexisTools/issues/7 ...files loaded [0.06 secs] ...articles split [0.073 secs] ...lengths extracted [0.073 secs] ...headlines extracted [0.074 secs] ...newspapers extracted [0.074 secs] ...dates extracted [0.077 secs] ...authors extracted [0.078 secs] ...sections extracted [0.078 secs] ...editions extracted [0.078 secs] ...dates converted [0.082 secs] ...metadata extracted [0.084 secs] ...article texts extracted [0.09 secs] ...superfluous whitespace removed from articles [0.10 secs] ...superfluous whitespace removed from paragraphs [0.11 secs] Elapsed time: 0.11 secs article@meta$Author [1] NA NA NA NA NA NA NA NA NA NA

Originally posted by @tommyxie in https://github.com/JBGruber/LexisNexisTools/issues/7#issuecomment-520876120

JBGruber commented 5 years ago

I moved this here since it seems to be a bigger issue. I'm not sure what's happening. Can you post your sessionInfo()? I'm thinking about it in the meantime.

JBGruber commented 5 years ago

I assume packageDate("LexisNexisTools") returns "2019-07-30"? If so, can you try:

library(LexisNexisTools)

data <- lnt_read("~/Files(10).DOCX", author_keyword = "^Byline:", verbose = FALSE) 

data@meta$Author
tommyxie commented 5 years ago

Yes, package date is the same. I will try the code later.

packageDate("LexisNexisTools") [1] "2019-07-30"

sessionInfo() R version 3.5.3 (2019-03-11) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Mojave 10.14.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] LexisNexisTools_0.2.3.9000

loaded via a namespace (and not attached): [1] Rcpp_1.0.2 quanteda_1.5.1 pillar_1.4.2 compiler_3.5.3
[5] prettyunits_1.0.2 remotes_2.1.0 tools_3.5.3 stopwords_1.0
[9] pkgbuild_1.0.4 lubridate_1.7.4 tibble_2.1.3 gtable_0.3.0
[13] lattice_0.20-38 pkgconfig_2.0.2 rlang_0.4.0 Matrix_1.2-15
[17] fastmatch_1.1-0 cli_1.1.0 rstudioapi_0.10 curl_3.3
[21] parallel_3.5.3 xml2_1.2.2 withr_2.1.2 dplyr_0.8.0.1
[25] stringr_1.4.0 rprojroot_1.3-2 grid_3.5.3 tidyselect_0.2.5
[29] glue_1.3.1 data.table_1.12.2 R6_2.4.0 processx_3.3.0
[33] pbapply_1.4-1 callr_3.2.0 ggplot2_3.2.1 purrr_0.3.2
[37] spacyr_1.2 magrittr_1.5 backports_1.1.4 ps_1.3.0
[41] scales_1.0.0 stringdist_0.9.5.2 assertthat_0.2.1 colorspace_1.4-1
[45] striprtf_0.5.2 stringi_1.4.3 lazyeval_0.2.2 RcppParallel_4.4.3 [49] munsell_0.5.0 crayon_1.3.4

tommyxie commented 5 years ago

Here's what was returned after running the code.

library(LexisNexisTools)

data <- lnt_read("~/Files(10).DOCX", author_keyword = "^Byline:", verbose = FALSE) Reading DOCX files from Nexis Uni is experimental. Please report any problems in this issue: https://github.com/JBGruber/LexisNexisTools/issues/7

data@meta$Author [1] " Christopher Flavelle Highlight: The world's land is being exploited at an “unprecedented” rate, a United Nations report on climate change warns, putting pressure on food production and amplifying the risk of mass migration."
[2] " Rod Schoonover Highlight: Politics intruded on science and intelligence. That’s why I quit my job as an analyst for the State Department."
[3] " By NATHANIEL RICH Nathaniel Rich is a writer at large for The New York Times Magazine, for which he has written about immortal jellyfish, a 47-hour train ride between New Orleans and Los Angeles and a lawyer's campaign to expose DuPont's profligate use of a toxic chemical. He is the author of three novels, including ''King Zeno,'' which was published in January. George Steinmetz is a photographer who specializes in aerial imagery. He has won numerous awards including three prizes from World Press Photo and the Environmental Vision Award for his work on large-scale agriculture. He has published four books of photography, including his latest, ''New York Air: The View From Above.'' With additional reporting by Jaime Lowe, who is a frequent contributor to the magazine and the author of ''Mental: Lithium, Love and Losing My Mind.'' She previously wrote a feature about the incarcerated women who fight California wildfires." [4] " By ALAN SANO Body"
[5] NA
[6] " By KENDRA PIERRE-LOUIS Body"
[7] " THE LEARNING NETWORK Highlight: A special Earth Day guest lesson, written with NASA’s Goddard Institute for Space Studies, a leader in global climate change research, and the Columbia University Earth Institute. It offers resources for teaching about this issue, while addressing important 21st-century literacy skills."
[8] " Kendra Pierre-Louis Highlight: The average number of heat waves in 50 major American cities has tripled since the 1960s."
[9] " Kendra Pierre-Louis Highlight: The average number of heat waves in 50 major American cities has tripled since the 1960s."
[10] " Henry Fountain Highlight: The four-day hot spell was rare for France and the Netherlands, researchers say, but it used to be a lot rarer."

JBGruber commented 5 years ago

Thanks. It looks to me like the file was read in with a slightly different encoding on your machine. I have no idea why that might happen tbh. I changed the relevant code and made it explicit that UTF-8 should be used while reading the file in. Please install the newest version and let me know if the behaviour changes.

JBGruber commented 5 years ago

Did you have the chance to test this with the new version (packageDate("LexisNexisTools") "2019-08-16")?