Author not detected in Nexis Uni file #10

Closed JBGruber closed 5 years ago

JBGruber commented 5 years ago

hum...I am getting all NAs on Author. Not sure why.

library(LexisNexisTools) Warning message: article <- lnt_read("Files(10).DOCX") Creating LNToutput from 1 file... Reading DOCX files from Nexis Uni is experimental. Please report any problems in this issue: https://github.com/JBGruber/LexisNexisTools/issues/7 ...files loaded [0.06 secs] ...articles split [0.073 secs] ...lengths extracted [0.073 secs] ...headlines extracted [0.074 secs] ...newspapers extracted [0.074 secs] ...dates extracted [0.077 secs] ...authors extracted [0.078 secs] ...sections extracted [0.078 secs] ...editions extracted [0.078 secs] ...dates converted [0.082 secs] ...metadata extracted [0.084 secs] ...article texts extracted [0.09 secs] ...superfluous whitespace removed from articles [0.10 secs] ...superfluous whitespace removed from paragraphs [0.11 secs] Elapsed time: 0.11 secs article@meta$Author [1] NA NA NA NA NA NA NA NA NA NA

JBGruber commented 5 years ago

I moved this here since it seems to be a bigger issue. I'm not sure what's happening. Can you post your sessionInfo()? I'm thinking about it in the meantime.

JBGruber commented 5 years ago

I assume packageDate("LexisNexisTools") returns "2019-07-30"? If so, can you try:


data <- lnt_read("~/Files(10).DOCX", author_keyword = "^Byline:", verbose = FALSE) 

tommyxie commented 5 years ago

Yes, package date is the same. I will try the code later.

packageDate("LexisNexisTools") [1] "2019-07-30"

R version 3.5.3 (2019-03-11) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Mojave 10.14.6

tommyxie commented 5 years ago

Here's what was returned after running the code.


data <- lnt_read("~/Files(10).DOCX", author_keyword = "^Byline:", verbose = FALSE) Reading DOCX files from Nexis Uni is experimental. Please report any problems in this issue: https://github.com/JBGruber/LexisNexisTools/issues/7

data@meta$Author [1] " Christopher Flavelle Highlight: The world's land is being exploited at an “unprecedented” rate, a United Nations report on climate change warns, putting pressure on food production and amplifying the risk of mass migration."
[2] " Rod Schoonover Highlight: Politics intruded on science and intelligence. That’s why I quit my job as an analyst for the State Department."
[3] " By NATHANIEL RICH Nathaniel Rich is a writer at large for The New York Times Magazine, for which he has written about immortal jellyfish, a 47-hour train ride between New Orleans and Los Angeles and a lawyer's campaign to expose DuPont's profligate use of a toxic chemical. He is the author of three novels, including ''King Zeno,'' which was published in January. George Steinmetz is a photographer who specializes in aerial imagery. He has won numerous awards including three prizes from World Press Photo and the Environmental Vision Award for his work on large-scale agriculture. He has published four books of photography, including his latest, ''New York Air: The View From Above.'' With additional reporting by Jaime Lowe, who is a frequent contributor to the magazine and the author of ''Mental: Lithium, Love and Losing My Mind.'' She previously wrote a feature about the incarcerated women who fight California wildfires." [4] " By ALAN SANO Body"
[5] NA
[7] " THE LEARNING NETWORK Highlight: A special Earth Day guest lesson, written with NASA’s Goddard Institute for Space Studies, a leader in global climate change research, and the Columbia University Earth Institute. It offers resources for teaching about this issue, while addressing important 21st-century literacy skills."
[8] " Kendra Pierre-Louis Highlight: The average number of heat waves in 50 major American cities has tripled since the 1960s."
[9] " Kendra Pierre-Louis Highlight: The average number of heat waves in 50 major American cities has tripled since the 1960s."
[10] " Henry Fountain Highlight: The four-day hot spell was rare for France and the Netherlands, researchers say, but it used to be a lot rarer."

JBGruber commented 5 years ago

Thanks. It looks to me like the file was read in with a slightly different encoding on your machine. I have no idea why that might happen tbh. I changed the relevant code and made it explicit that UTF-8 should be used while reading the file in. Please install the newest version and let me know if the behaviour changes.

JBGruber commented 5 years ago

Did you have the chance to test this with the new version (packageDate("LexisNexisTools") "2019-08-16")?