JBGruber / LexisNexisTools

:newspaper: Working with newspaper data from 'LexisNexis'
103 stars 22 forks source link

New LexisNexis interface: Nexis Uni #7

Closed mrmvergeer closed 4 years ago

mrmvergeer commented 5 years ago

First off: your package is great. I used it frequently, without any issues :-) . Recently Lexis Nexis introduced a new web interface: Lexis Uni. The download formats won;t work well with your package. This is a recent change and I am sure you haven't thought about it yet. Still, are you considering adapting your package to the news download formats? One major drawback though is that the download options offer in the news interface - as far as I can tell - don't immediately provide the entire articles for downloads, only summaries.

JBGruber commented 5 years ago

I don't know Lexis Uni. All I could find was mentions of Nexis Uni, which seems to be a replacement for LexisNexis Academic (correct?). I only ever heard of LexisNexis Academic from friends based at Dutch universities and as far as I got it, it is just a different interface with the same technology and data in the background like nexis.com. My university has used the normal (as far as I can tell) nexis.com interface. That's what I used for writing the package.

I would like to provide support for the new interface but I'm not quite sure I get what it is. Can you switch back to the old way of searching and downloading? Are there any advantages of Lexis/Nexis Uni? Can you still download txt files? If you only get summaries, do you just want to extract the metadata?

If you could send me an example, I can look into the file structure and maybe it still works with a few minor adjustments.

mrmvergeer commented 5 years ago

apologies for the delay. So here's an example retrieved from Lexis Uni. The first file is the result of a search query: ten articles saved as word docx. The pdf is the first article when clicked on the title link. So it seems that the direct export of multiple complete articles has been removed.

Resultatenlijst voor_ aanslag.DOCX first article.pdf first article html.zip

Also, Lexis Uni will replace LexisNexis. At my university in 10 days (without any notice :-( ) I wonder whether it is possible to scrape the HTML-page (see third file: html page fo first article. But the challenge would be to follow the link from r while having access to LN. I am not sure they have an API of some sorts. If you have further questions or thoughts, let me know.

JBGruber commented 5 years ago

This looks like screenshots from the site. Have you checked nexis' documentation for the new interface? This, for example? In your PDF it looks like you have a download button.

It seems they removed txt as a download option. Could you confirm this? According to the documentation .pdf, .docx and .rtf are the available options now for full text. If I had one of each, I could check if I can make the package compatible with one or more of the formats.

JBGruber commented 5 years ago

Hi @mrmvergeer, has there been any progress with trying to read in files from the new interface?

I started to support more file formats (see #8) so if you manage to download full-text articles from Nexis Uni it should be possible to use them with the package now. Just install the development version with:

devtools::install_github("JBGruber/LexisNexisTools")
JBGruber commented 5 years ago

I have access to Nexis Uni while visiting a different institution and have to say: it is worse in every aspect that counts (the UI looks nicer but I don't care at all about it if it breaks the functionality):

I don't have the time to take care of this at the moment but will try to update the package as soon as possible. Maybe a fix is easier than I think at the moment. But for now this is a bigger problem than I thought.

mrmvergeer commented 5 years ago

sorry to respond late, swamped with teaching. Yeah, I suspected this to be the case. In a spare moment I tried to get some sample material for you, but navigating the UI is worse than before IMHO. Until September I'm quite busy, but if you need input, let me know

erickaakcire commented 5 years ago

(BTW, hi Johannes, nice to meet you at EPSA, and thanks for sharing this!) My university was switched over to the new "Nexis Uni." I was hoping there was some way to get back the previous functionality of downloading text files, but it doesn't look like it. RTF is the closest option and just 10 at a time... They have a single file RTF format and an RTF format that provides an individual file per story, so I'm not sure what will end up being easier to parse. I think I will have to make this work as I don't see any other options for historical news content in different countries. I have a student doing an independent study starting in the fall, so I think we can help contribute to getting this running with the new format. We can coordinate so as to not duplicate efforts via email, just wanted to post in case others also want to contribute or have other solutions.

JBGruber commented 5 years ago

Thanks, @erickaakcire (it was nice to meet you, too!)!

Surprisingly, after some testing with the two available packages for parsing RTF files (striprtf and unrtf), it looks like DOCX is the best option. Reading files into R is a lot faster and seemingly cleaner and needs just a few lines of code.

Another advantage is that it was not available before the switch, which means that we can use the file format to detect if a file came from Nexis Uni instead of the old interface and divert it to a set of new parsing rules.

One problem I found is that I'm not sure how to remove the cover page, as there is no "X of X Documents" line to indicate where an article starts. But besides that, the other keywords are relatively easy to find: "End of Document" marks the end of each article and "Body" marks the end of the metadata lines.

Once the articles are chopped up properly, I can start adapting the parsing rules. I could really use some help testing the robustness of the new parsing rules (especially if you are downloading many files anyway). But if you want to help with anything else, let me know!

JBGruber commented 5 years ago

After finding out how to read in files properly, writing the parsing rules was easier than I thought. It would be great if a few people could test if everything is working. Just install the branch with:

remotes::install_github("JBGruber/LexisNexisTools")

Reading in files works just as before:

articles <- lnt_read("Files(10).DOCX")

The files I used for testitng did not have different editions, so edition will be NA until I have time and appropriate testing files to implement the feature.

profkenm commented 5 years ago

Thank you for your work on this. Let me mention that I was able to change the settings on Nexis Uni to allow for slightly faster downloading. Specifically, I was able to change the settings so that one can load/click on 50 documents at a time rather than the default 10. And one can download (or email) up to 100 documents at a time. Here is how I did it:

Set up an account in Nexis Uni (even if your institution has a subscription) and log in. Next, click on your username in the top right of the screen. In the drop down menu, click "settings." Then, under "Number of results to display per page," change it from 10 to 50 (50 is the maximum). Then, at the bottom of the page, click "save settings and close."

Search Nexis Uni. Results should now display 50 documents at a time (rather than 10). Click on the box at the top of the search to select all 50 documents. Then go to the page 2 of the search. Click the box at the top of that search. You have now selected 100 documents. Download or email those 100 documents.
If you attempt to download or email more than 100, you will get an error message which says that 100 is the limit. Repeat this process until all the documents in your search are downloaded.

When saving them to a folder, the choices are pdf and Word. When emailing them, they are rtf, pdf, and Word.

We are currently working with CNN transcripts downloaded from the CNN website (which are more tractable, on the whole, than those obtained from Lexis Nexis). But we will come back to Nexis Uni soon and try your program fix. Thanks again for your work on it.

Ken Mulligan, Political Science, Southern Illinois University Carbondale

JBGruber commented 5 years ago

I don't see an issue with the file. Can you elaborate what the problem is?

library(LexisNexisTools)
#> LexisNexisTools Version 0.2.3.9000

data <- lnt_read("~/Files(10).DOCX", verbose = FALSE) 
#> Reading DOCX files from Nexis Uni is experimental. Please report any problems in this issue: https://github.com/JBGruber/LexisNexisTools/issues/7

data@articles
#> # A tibble: 10 x 2
#>       ID Article                                                           
#>    <int> <chr>                                                             
#>  1     1 Want climate news in your inbox? Sign up here for Climate Fwd:, o~
#>  2     2 Politics intruded on science and intelligence. That’s why I quit ~
#>  3     3 Thirty years ago, we had a chance to save the planet. The science~
#>  4     4 FIREBAUGH, Calif. -- Many farmers probably haven't read the new r~
#>  5     5 Do you think it's possible for the next president to stop climate~
#>  6     6 "Two thirds of the United States is expected to bake under what c~
#>  7     7 "Slide show from the 2014 article \"Warming Temperatures Threaten~
#>  8     8 "Two-thirds of the United States is expected to bake under what c~
#>  9     9 "Two-thirds of the United States is expected to bake under what c~
#> 10    10 Climate change made the stifling heat that enveloped parts of Eur~

Created on 2019-08-13 by the reprex package (v0.3.0)

Article 9 is a duplicate to 8. But they are both in the source file.

tommyxie commented 5 years ago

JBGruber, thanks for looking into it! I realized I didn't inspect the articles the right way. While I have you, in meta, the bylines in the document didn't get read into the author column. Can you confirm?

JBGruber commented 5 years ago

I don't see a problem there either:

library(LexisNexisTools)
#> LexisNexisTools Version 0.2.3.9000

data <- lnt_read("~/Files(10).DOCX", verbose = FALSE) 
#> Reading DOCX files from Nexis Uni is experimental. Please report any problems in this issue: https://github.com/JBGruber/LexisNexisTools/issues/7

data@meta$Author
#>  [1] "Christopher Flavelle"   "Rod Schoonover"        
#>  [3] "By NATHANIEL RICH"      "By ALAN SANO"          
#>  [5] NA                       "By KENDRA PIERRE-LOUIS"
#>  [7] "THE LEARNING NETWORK"   "Kendra Pierre-Louis"   
#>  [9] "Kendra Pierre-Louis"    "Henry Fountain"

Created on 2019-08-13 by the reprex package (v0.3.0)

5th one is NA but there is no author in the source file either. The only thing not working is editions (as mentioned above) but I should be able to get to it soon.

tommyxie commented 5 years ago

hum...I am getting all NAs on Author. Not sure why.

library(LexisNexisTools) Warning message: article <- lnt_read("Files(10).DOCX") Creating LNToutput from 1 file... Reading DOCX files from Nexis Uni is experimental. Please report any problems in this issue: https://github.com/JBGruber/LexisNexisTools/issues/7 ...files loaded [0.06 secs] ...articles split [0.073 secs] ...lengths extracted [0.073 secs] ...headlines extracted [0.074 secs] ...newspapers extracted [0.074 secs] ...dates extracted [0.077 secs] ...authors extracted [0.078 secs] ...sections extracted [0.078 secs] ...editions extracted [0.078 secs] ...dates converted [0.082 secs] ...metadata extracted [0.084 secs] ...article texts extracted [0.09 secs] ...superfluous whitespace removed from articles [0.10 secs] ...superfluous whitespace removed from paragraphs [0.11 secs] Elapsed time: 0.11 secs article@meta$Author [1] NA NA NA NA NA NA NA NA NA NA

JBGruber commented 5 years ago

Since commit c495578, issues with the parsed articles (they seemed to be returned as factor in some cases) and the meta information for the first article in a file should be fixed.

The only thing left to do now is fixing how Editions are parsed. I do not have test files that include Editions so far though.

JBGruber commented 5 years ago

I'm moving this along soon as I would like to release a new version to CRAN with the new feature. It would be great if everyone who participated in this thread could let me know if they encountered any more problems since support for nexis uni was introduced.

mrmvergeer commented 5 years ago

dear Johannes I tested your renewed package. It certainly does the job again. I downloaded 100 news articles and they were correctly imported and converted. One minor issue: in the column Section, sometimes also a semicolon and a reference to a page number is listed: "BUITENLAND; Blz.6", which means "foreign: page 6". Probably this is how the newspapers themselves supply their data to LN. Nothing that can be solved with some good ol' fashioned data cleaning. Anyhow, as far as I can tell, it works perfectly for my needs. Maybe in the next days I:ll be doing some more testing.

best Maurice

harrypickard commented 5 years ago

Hi,

I am having some problems loading .DOCX files since the move to Nexis Uni.

Running this code LNToutput <- lnt_read(x = getwd())

I get the following output and error Creating LNToutput from 1 file... Reading DOCX files from Nexis Uni is experimental. Please report any problems in this issue: https://github.com/JBGruber/LexisNexisTools/issues/7 ...files loaded [0.26 secs] Error in articles.l[[1]] : subscript out of bounds

Any ideas?

JBGruber commented 5 years ago

Hi @harrypickard,

The error means that in your case not a single article was found in the DOCX (I check if the first article is actually a cover page and this test fails). I'm adding a more descriptive error message now.

Would you mind sending me a small file which produces this error, so I can have a closer look?

Thanks

harrypickard commented 5 years ago

Hi @JBGruber,

Thanks for the reply. Attached is a test document that produces this error for me, which does have a cover page.

test.DOCX

Update: It seems that I have solved my own problem. Moving away from the .txt system threw me quite a bit. Previously, I could download all the article text without having to manually select 10 (50) articles at a time... I have successfully managed to replicate what I had before using the .txt method with the file attached below.

Files(10).DOCX

JBGruber commented 5 years ago

Yes, I was very confused by this at first too... Glad you could solve it!

juliettemchardy commented 5 years ago

Dear Johannes,

I am not sure whether the issue I'm having is pertinent to this thread but have been noticing it with the .docx lexis files I'm trying to worth so decided to post it here.

Am getting this message:

Creating LNToutput from 3 files...
Error in open.connection(x, "rb") : cannot open the connection
In addition: Warning message:
In open.connection(x, "rb") : cannot open zip file 'Files(1-100).DOCX'

Is there something wrong with the files I'm working with or is it an issue with my use of R?

Files(1-100).DOCX Files(101-200).DOCX Files(201-236).DOCX

Any help would be very much appreciated!

Thanks,

Juliette

JBGruber commented 5 years ago

I just tested your files. There is an issue with some of the dates which are not parsed (I'll look into that when I have time) but otherwise it seems to work fine. Are you calling the function like this:

library(LexisNexisTools)
lnt <- lnt_read("F:/Dropbox/LexisNexis_sample_files/Files.1-100.DOCX")

The only way I can reproduce your error message is if I try to read the file from a remote connection instead of a local one:

library(LexisNexisTools)
#> LexisNexisTools Version 0.2.3.9000
lnt <- lnt_read("https://github.com/JBGruber/LexisNexisTools/files/3835602/Files.1-100.DOCX")
#> Creating LNToutput from 1 file...
#> Warning in open.connection(x, "rb"): cannot open zip file 'https://
#> github.com/JBGruber/LexisNexisTools/files/3835602/Files.1-100.DOCX'
#> Error in open.connection(x, "rb"): cannot open the connection
juliettemchardy commented 5 years ago

Thanks, Johannes, for such a rapid and helpful response! There must be something incorrect with my setup -- I'll focus on this.

Johgerw commented 4 years ago

Hello! I am glad I found this thread, as I had some serious problems getting the LexisNexisTools package to work at all. I wasn't aware that there were such severe differences between the old LexisNexis platform and NexisUni (which is the only one I know). Anyway, with the update I managed to read in the files, but what is still not working right now is the command: View(LNToutput@articles$Article[LNToutput@meta$Graphic]), which you posted in your tool description. Basically, I need to create a table/Excel spreadsheet from the articles I have got. (Apparently there is a function to download Excelsheets with meta data from NexisUni directly, but this option is not available to me...). Also, I wanted to ask how it is possible to read in several files from a folder as it is now possible to download all articles as individual files in a zip-Folder. Sorry, if these questions are basic...not an R-pro, me. Thanks for your help!

JBGruber commented 4 years ago

convert to Excel/table

View(LNToutput@articles$Article[LNToutput@meta$Graphic]) I would assume what you get is an empty page as the problem with empty articles does not exist with the new interface (one of the few upsides)

If you want to look at the meta table/spreadsheet, use View(LNToutput@meta) or convert the data to a normal data.frame with:

df <- lnt_convert(LNToutput, to = "data.frame", what = "articles") # or what = "paragraphs"
View(df)
## or export it to Excel
rio::export(df, "LN.xlsx")

The function to export meta data directly from LexisNexis as Excel file doesn't give you the article itself so it is only useful for the most basic frequency analysis (which I would discourage as LN has so many duplicates in its database and frequencies are usually biased).

read multiple files at once

This is a core feature of LexisNexisTools. lnt_read accepts a vector of file names or a whole directory (also check out the recursive argument). So you can simply use:

LNToutput <- LexisNexisTools::lnt_read("C:/path/to/downlaoded/files", file_type = c("docx", "zip"))

To search a directory for docx or zip files and read them in. I recommend downloading docx files.

Johgerw commented 4 years ago

Thanks so much for your quick reply! For some reason I can only create LNToutput from a single docx-file (including all the downloaded articles) and not from a folder of several docx-files. The function above gives me the following error message:

LNToutput <- lnt_read("C:/Users/johan/Desktop/MLEJaf_word", file_type = c("docx")) Creating LNToutput from 272 files... (so R does find the relevant folder...) ...files loaded [1.10 secs] Fehler in articles.l[[length(articles.l)]] <- NULL : attempt to select less than one element in integerOneIndex

Thanks for your help and your time!

JBGruber commented 4 years ago

Never seen this error but maybe try:

LNToutput <- lnt_read("C:/Users/johan/Desktop/MLEJaf_word/", file_type = c("docx"))
mikhailacalice commented 3 years ago

Hi Johannes, This thread is very helpful and shows others that have had a similar issue that I am having (see error below).

Creating LNToutput from 10476 files...
Error in open.connection(x, "rb") : cannot open the connection
In addition: Warning message:
In open.connection(x, "rb") :
  cannot open zip file 'C:/Users/calic/Box/2021_PA866 paper/data/866_paper/AUS_fire_combined/~$rewth! (10).docx'

First I thought that there was an issue with the files being "zipped" so I took all restrictions off of them. Then after reading other people's issues I thought the issue might be that I'm trying to access the files through Box (like dropbox), so I copied the files from Box and created a local folder with all of them. When I ran it I still got this issue:

> AUSfire_LNT <- lnt_read(x = "C:/Users/calic/Desktop/AUS_fire_combined")
Creating LNToutput from 10476 files...
Error in open.connection(x, "rb") : cannot open the connection
In addition: Warning messages:
1: In for (i in seq_len(n)) { :
  closing unused connection 3 (C:/Users/calic/Box/2021_PA866 paper/data/866_paper/AUS_fire_combined/~$rewth! (10).docx:word/document.xml)
2: In open.connection(x, "rb") :
  cannot open zip file 'C:/Users/calic/Desktop/AUS_fire_combined/~$rewth! (10).docx'

The strange thing is, I was able to do this with another folder of documents pulled from Nexis Uni that had fewer number of documents in them. So I tested it out with a sample of the AUS_fire_combined folder, and it worked. So i'm wondering if perhaps there is like a capacity limit that I have hit somehow? Is there a way to run multiple and then combine a corpus? I'm just not sure what to do... let me know your thoughts!

Cheers, Mikhaila

JBGruber commented 3 years ago

Hi,

the problem is the file you are trying to open. ~$rewth! (10).docx is a Word lock file. LexisNexisTools can't open it because it is empty. Usually it should diappear after closing Word but sometimes these files remain when e.g., Word crashes. Let me see if I can write a quick update to ignore these files. But you should be good to go if you simply delete it (it's probably hidden as well).

JBGruber commented 3 years ago

This should be resolved with 80a319fff2acf20e091c58212b10a7a4a323528e. Please install the new version with remotes::install_github("JBGruber/LexisNexisTools") and let me know if it works.

mikhailacalice commented 3 years ago

Ah thank you for explaining that to me! I couldn't quite figure out how to delete that file -- it does seem to be hidden. I tried installing the new version. In doing so, I was told that the packages have more recent versions available and it recommended that I update all of them, so I did and this is the error I received (I also tried it not updating the packages and still get the message):

package ‘quanteda’ successfully unpacked and MD5 sums checked
Error: Failed to install 'LexisNexisTools' from GitHub:
  (converted from warning) cannot remove prior installation of package ‘quanteda’

I tried uninstalling quanteda and running again and still got this message.

louislegum commented 2 years ago

Hi Johannes,

Thanks for the helpful feed! I am also trying to import downloaded DOCX files from a Zip file using this code: LNToutput <- LexisNexisTools::lnt_read("C:/Users/Louis Legum/OneDrive/Documents/4th_year_Lynx_reintroduction/Data_analysis/Prelim_analysis/text_dir", file_type = c("docx", "zip"))

but I am getting this error message (see below) and was wondering if you knew a solution? "Error in lnt_parse_uni(lines = lines$uni, extract_paragraphs = extract_paragraphs, : No articles found to parse "

Thanks for the help! Louis