geanders / noaastormevents

explore noaa storm database
14 stars 1 forks source link

Error with find_events #16

Closed theresekon closed 4 years ago

theresekon commented 4 years ago

When I try to run the following code, I get an error message that reads: "Error in find_file_name(year = year, file_type = file_type) : No file found for that year and / or file type."

#install.packages("noaastormevents")

library(drat)
addRepo("geanders")
#install.packages("hurricaneexposuredata")

find_events(date_range = c("1999-09-14", "1999-09-18"))
geanders commented 4 years ago

I just tested on my computer and get the same error. It looks like the error is cropping up in the find_file_name function, so that'll be the next place for us to investigate.

geanders commented 4 years ago

Here's the code for how we currently put together the file names to query:

file_name <- find_file_name(year = year, file_type = file_type)
  path_name <- paste0("https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/",
                      "csvfiles/",file_name)

It would be worth looking at the NOAA storm events page to check to see if there's a change in how they're naming their files. @theresekon, if you get the chance, could you look at https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/ and see if you can outline the patterns that they're using to name their files? In particular, it would be helpful for us to know the name of the files for events in 1999, as that will apply directly to the example code we're trying to run here.

geanders commented 4 years ago

Here's our current code for finding the file name (in https://github.com/geanders/noaastormevents/blob/master/R/input.R):

find_file_name <- function(year = NULL, file_type = "details") {
  url <- paste0("http://www1.ncdc.noaa.gov/pub/data/swdi/",
                "stormevents/csvfiles/")
  page <- htmltab::htmltab(doc = url, which = 1, rm_nodata_cols = FALSE)
  all_file_names <- page$Name
  file_year <- paste0("_d",year,"_")
  file_name <- grep(file_type, grep(file_year, all_file_names, value = TRUE),
                    value = TRUE)
  if(length(file_name) == 0){
    stop("No file found for that year and / or file type.")
  }
  return(file_name)
}

We are searching for the pattern "_d[year]_" in a file name (e.g., "_d1999_" for a file for events in 1999"). If they've changed the rules they've used to name their files, that might have broken our code here. Another possibility is that there's a higher-level change in the webpage that's preventing us from reading in the file names to search them in the first place, but let's start by checking out how they name their filenames and see if that's the issue. If so, that might be an easy fix.

theresekon commented 4 years ago

It still appears that they are naming their files including "_d[year]_ ". This is the file name for the 1999 csv file from that webpage:

StormEvents_details-ftp_v1.0_d1999_c20200518.csv.gz

geanders commented 4 years ago

Okay, great. I think it's something, then, with:

page <- htmltab::htmltab(doc = url, which = 1, rm_nodata_cols = FALSE)
geanders commented 4 years ago

Try this next:

# install.packages("htmltab")
library(htmltab)

url <- paste0("http://www1.ncdc.noaa.gov/pub/data/swdi/",
                "stormevents/csvfiles/")
page <- htmltab::htmltab(doc = url, which = 1, rm_nodata_cols = FALSE)
# ?htmltab

In particular, what's in the page object that we end up with? Does it include an element named "Name"?

Here's some code for checking:

page # Prints out the whole object
str(page) # Lists all the elements in the object and the first few values of each
class(page) # Tells us what type of object we've got
theresekon commented 4 years ago

The page object does include an element named "Name" that the file names are under. Running class(page) tells us that the object is a "data.frame". The rm_nodata_cols argument asks if columns that have no alphanumeric data should be removed and the default is TRUE.

geanders commented 4 years ago

What do each of the objects look like if you then run:

year <- 1999
all_file_names <- page$Name
file_year <- paste0("_d",year,"_")
file_name <- grep(file_type, grep(file_year, all_file_names, value = TRUE),
                    value = TRUE)
theresekon commented 4 years ago

When I run

file_name <- grep(file_type, grep(file_year, all_file_names, value = TRUE),
                  value = TRUE)

I get an error message saying "Error in grep(file_type, grep(file_year, all_file_names, value = TRUE), : object 'file_type' not found".

geanders commented 4 years ago

Ohp, right! Add this line first:

file_type <- "details"
theresekon commented 4 years ago

Thank you! The year object produces [1] 1999, all_file_names produces a list of the names of all the csv files, file_year produces "_d1999_", and file_name produces character(0).

Does this answer the question you were asking?

geanders commented 4 years ago

Yes! This is exactly what I was looking for.

Hmmm. Okay, so it looks like we're grabbing all the file names without a problem, but then it's not finding the file with "_d1999_" in it when we search for that pattern, which is why the file_name object ends up being empty at the end (character(0)).

Could you paste in all the csv file names that are in file_year here? We should make sure that there really is a file with "_d1999_" in it in that list (there should be because you confirmed that they're still using that file naming convention, and we should be getting all those file names into R, but it would help to confirm). If so, then the issue will come down to this line of the code:

file_name <- grep(file_type, grep(file_year, all_file_names, value = TRUE),
                    value = TRUE)

This is the step where we ask R to look through all the file names for that pattern of "_d1999_" and then return the file name that has it. We've actually got one grep inside of another here, so we're looking first for the year pattern and then for the file type ("details" in this case). We could try to see which step of this is the problem by separating the two steps and looking at the output with each part:

file_name1 <- grep(file_year, all_file_names, value = TRUE)
file_name2 <- grep(file_type, file_name1, value = TRUE)

Could you try that and see what file_name1 and file_name2 look like?

theresekon commented 4 years ago

Here is a list of all of the csv file names in all_file_names. I'm not sure how to get it to show the full file name as they are somewhat long. Only "_d1999_" appears when I run file_year.

Also file_name1 and file_name2 both produce character(0).

[1] "Parent Directory" "Storm-Data-Export-Fo..>" "Storm-Data-Export-Fo..>" [4] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [7] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [10] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [13] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [16] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [19] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [22] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [25] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [28] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [31] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [34] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [37] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [40] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [43] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [46] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [49] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [52] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [55] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [58] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [61] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [64] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [67] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [70] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_details-..>" [73] "StormEvents_details-..>" "StormEvents_details-..>" "StormEvents_fataliti..>" [76] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [79] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [82] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [85] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [88] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [91] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [94] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [97] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [100] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [103] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [106] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [109] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [112] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [115] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [118] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [121] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [124] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [127] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [130] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [133] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [136] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [139] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [142] "StormEvents_fataliti..>" "StormEvents_fataliti..>" "StormEvents_fataliti..>" [145] "StormEvents_fataliti..>" "StormEvents_location..>" "StormEvents_location..>" [148] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [151] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [154] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [157] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [160] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [163] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [166] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [169] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [172] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [175] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [178] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [181] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [184] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [187] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [190] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [193] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [196] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [199] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [202] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [205] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [208] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [211] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [214] "StormEvents_location..>" "StormEvents_location..>" "StormEvents_location..>" [217] "legacy/" "ugc_areas.csv"

geanders commented 4 years ago

Ok, this is helpful.

I think we want to check if these abbreviated form of the file names ("StormEvents_location..>") is just how R's printing them out for us, or really all that we're getting from the website when we pull the file names.

What do we get with the call:

all_file_names[2]

Does it look like there's a longer value than ""StormEvents_location..>" when we print out just one value?

Also, let's take a closer look at the full dataframe. Could you put in the output from:

head(page)
theresekon commented 4 years ago

all_file_names[2] produces just [1] "Storm-Data-Export-Fo..>" so it looks like that might be all we are getting from the website.

head(page) produces

V1 Name Last modified Size Description 3 \ Parent Directory \ - \ 4 \ Storm-Data-Export-Fo..> 2014-05-06 21:49 23K \ 5 \ Storm-Data-Export-Fo..> 2019-02-25 13:16 131K \ 6 \ StormEvents_details-..> 2017-01-20 11:01 10K \ 7 \ StormEvents_details-..> 2016-02-24 10:07 12K \ 8 \ StormEvents_details-..> 2017-06-19 09:20 12K \

geanders commented 4 years ago

Okay, yes, I definitely think that could be causing the problem.

It looks like, if you look at the page's source code, they are including the filename in a link that goes with the table entry, for example:

<a href="StormEvents_details-ftp_v1.0_d1988_c20170717.csv.gz">

The <a href=...> part is making the table entry a clickable link, and you can see that it's got the filename that we want to query in that, rather than printing it out anymore.

It looks like that all goes inside that cell of the table, for example:

<td><a href="StormEvents_details-ftp_v1.0_d1988_c20170717.csv.gz">StormEvents_details-..&gt;</a></td>

Here, the <td> ... </td> are the markers for the start and end of the contents of a cell in the table. So, if we can grab this info when we read in the HTML table, we can pull this part out.

If we need to, we can go the route of using something called regular expressions to pull this out from the original source code of the webpage. However, that might be a bit of a pain. It would be easier if we can pull this link in as part of the table, because then R puts it in a nice dataframe for us.

Could you check the helpfile for htmltab (once you've loaded the package with library(htmltab), you can open the helpfile for this function with ?htmltab) and see if you see any promising leads for pulling in link information (i.e., in that <a href ... > tag in HTML) when we read in a table? Maybe they'll have an option we can use to do this pretty easily.

theresekon commented 4 years ago

I found these options in the helpfile that seem like they might help.

headerSep a character vector that is used as a seperator in the construction of the table's variable names (default ' >> ')

body a vector that specifies which table rows should be used as body information. A numeric vector can be specified where each element corresponds to a table row. A character vector may be specified that describes an XPath for the body rows. If left unspecified, htmltab tries to use semantic information from the HTML code

rm_escape a character vector that, if specified, is used to replace escape sequences in header and body cells (default ' ')

geanders commented 4 years ago

These are good leads, but I'm not sure any will get us what we want. It sounds like headerSep might just be for the column names of the table, so I don't think that would get us there. It looks like body lets you specify which rows to get, so you could get a subset, but here we want all of the rows, just to process the information in them differently.

I just took a look at the helpfile, too, and it looks like the bodyFun option might let us change how the table cells are processed. I'm copying in the example from the helpfile:

doc <- "http://en.wikipedia.org/wiki/Usage_share_of_web_browsers"
 xp3 <-  "//table[7]"
 bFun <- function(node) {
   x <- XML::xmlValue(node)
   gsub('%$', '', x)
 }

 htmltab(doc = doc, which = xp3, bodyFun = bFun)

We also might want to check some other packages / functions for reading in HTML tables, especially in terms of looking for solutions for grabbing the information from links in the table entries. This post on Stackoverflow looked promising, although I just skimmed it so far.

If we need to go the route of reading the source and then cleaning it up with regular expressions, the RCurl package will be helpful:

# install.packages("RCurl")
library(RCurl)
storm_events_source <- getURL(fileURL)
geanders commented 4 years ago

Bingo! I think I've got a lead.

Try out:

# install.packages("RCurl")
# install.packages("XML")

library(RCurl)
library(XML)

storm_events_source <- getURL(fileURL)
all_file_names <- getHTMLLinks(storm_events_source)

Then you can try the next part of the function, with this list of file names:

file_year <- paste0("_d",year,"_")
file_name <- grep(file_type, grep(file_year, all_file_names, value = TRUE),
                    value = TRUE)

It looked like this will work on my computer. Make sure it works on yours, too, and I can show you how to update the function code for the package during our meeting today and push it to GitHub.

geanders commented 4 years ago

To prep for that, if you get a chance, try to read through these two chapters of a really nice book on R Packages:

Don't worry if it's all not clear yet. At this point, this will just help in giving you a general idea of how the package code is put together and how to load updated code in a function when you're working on making a package on your own computer.

theresekon commented 4 years ago

I think something is not working when I try to run this new code. Did you name something new with fileURL?

Also, yesterday I installed Xcode and updated/restarted my Mac and now my Git tab on Rstudio is missing. Do you know how to troubleshoot this?

geanders commented 4 years ago

Sorry, yes. I put the web address for the NOAA Storm Events file list in fileURL. Add this line before running the others:

fileURL <- "https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/"

For the Git tab question, I would recommend just closing and reopening RStudio to start with. If that doesn't work, we can take a look during our call.

theresekon commented 4 years ago

Okay, thank you! The new code seems to be working on my computer too.

geanders commented 4 years ago

Awesome! I'll walk you through making the fix on your computer during our meeting.