Nonprofit-Open-Data-Collective / irs990efile

R package for building a research database from IRS 990 efiler tax returns.
https://nonprofit-open-data-collective.github.io/irs990efile/
19 stars 4 forks source link

Warning message with build_index #1

Closed wgmmaas closed 7 months ago

wgmmaas commented 7 months ago

Hi @lecy et al.,

Thanks for your work on this package. I get a warning message that I did not get before:

> index <- build_index(tax.years = 2019)

Warning message:
One or more parsing issues, call problems() on your data frame for details, e.g.: dat <- vroom(...) problems(dat)

What could be the reason for the warning? And is it safe to ignore it, as I end up with the index of 523,999 observations (only two observations short of the 524,001 it should find for 2019 according to the README)? Thanks, Wim

lecy commented 7 months ago

I am not familiar with the warning, but I suspect it is from the readr package and probably related to data types.

See: https://github.com/tidyverse/readr/issues/1477

Or potentially dplyr when the disaggregated data frames are being stacked.

I suspect it's harmless - for example integers and doubles mixing, which impacts representation in memory in R but would not change how the data would appear once written to a CSV file.

But please let me know if you discover otherwise.

wgmmaas commented 7 months ago

Thanks Jesse, you are correct. It is a parsing problem in readr. It is guessing the "LegalDomicileCountry" column type incorrectly (see below). As this does not affect the rest of my application, I will ignore it. Thanks.

URL <- paste0("https://nccs-efile.s3.us-east-1.amazonaws.com/index/data-commons-efile-index-", 2019, ".csv")
d <- readr::read_csv(URL, show_col_types = FALSE)
parsing_problems <- problems(d)
if (nrow(parsing_problems) > 0) {
  print(parsing_problems)
}

> print(parsing_problems)
# A tibble: 181 x 5
     row   col expected           actual file 
   <int> <int> <chr>              <chr>  <chr>
 1  1819    13 1/0/T/F/TRUE/FALSE CA     ""   
 2  3225    13 1/0/T/F/TRUE/FALSE NI     ""   
 3  5076    13 1/0/T/F/TRUE/FALSE CA     ""   
 4  5078    13 1/0/T/F/TRUE/FALSE CJ     ""   
 5  5502    13 1/0/T/F/TRUE/FALSE CA     ""   
 6  7666    13 1/0/T/F/TRUE/FALSE HO     ""   
 7  8408    13 1/0/T/F/TRUE/FALSE CA     ""   
 8  9305    13 1/0/T/F/TRUE/FALSE UK     ""   
 9 14025    13 1/0/T/F/TRUE/FALSE AU     ""   
10 21681    13 1/0/T/F/TRUE/FALSE BD     ""   
# i 171 more rows
# i Use `print(n = ...)` to see more rows

Edit: I patched to the newest version that uses data.table and I do not get the error anymore, thanks!

lecy commented 7 months ago

Ok, great. And yes, I updated the build_index() function so that all columns are loaded as strings (character vectors). Glad it worked!