Open mstrimas opened 5 years ago
A related issue just happened to me when trying to read the entire Basic Dataset for Arizona via data.tables
. I wanted to add another example of a non-standard character breaking the fread()
call:
Warning message:
In fread(file, showProgress = TRUE, skip = 5148250, nrows = 10, :
Stopped early on line 5148258. Expected 47 fields but found 46. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<URN:CornellLabOfOrnithology:EBIRD:OBS545330372 2018-08-06 16:15:54.0 19232 species Red-eyed Vireo Vireo olivaceus 1 United States US Arizona US-AZ Maricopa US-AZ-013 US-AZ_3068 33 Riparian Preserve at Gilbert Water Ranch L144858 H 33.3614502 -111.7339478 2017-10-29 12:32:00 obsr218773 S40195602 Stationary P21 EBIRD 45 1 0 0 1 1 Targeted: REVI —>>
Here's a read_lines()
call to show the raw string. You can see the last tab character is missing:
> read_lines(file, skip = 5148257, n_max = 1)
[1] "URN:CornellLabOfOrnithology:EBIRD:OBS545330372\t2018-08-06 16:15:54.0\t19232\tspecies\tRed-eyed Vireo\tVireo olivaceus\t\t\t1\t\t\t\tUnited States\tUS\tArizona\tUS-AZ\tMaricopa\tUS-AZ-013\tUS-AZ_3068\t33\t\t\tRiparian Preserve at Gilbert Water Ranch\tL144858\tH\t33.3614502\t-111.7339478\t2017-10-29\t12:32:00\tobsr218773\tS40195602\tStationary\tP21\tEBIRD\t45\t\t\t1\t0\t\t0\t1\t1\t\t\tTargeted: REVI —"
Checking out the checklist record on the eBird website, the last field is supposed to continue on:
https://ebird.org/view/checklist/S40195602
Something about that long-dash is cutting off the rest of the data in the field, including the last tab character. Thankfully this was the only record with an issue in 12 million+ records, and I just fixed the line manually. I was able to read the entire data file after this.
I'm not sure if these errors need be fixed in the database itself since it's cutting off the data before it gets to the end user. Hopefully this issue is an appropriate venue for my problem since it's probably not auk
specific.
@matthewpaulking this valuable info, thanks! Can you let me know which version of the EBD you're using, and whether it's the full (200 GB) file or if you just downloaded the Arizona subset.
Sure thing! I'm using just the Arizona subset (4.4 GB), and it's from October 2018: "ebd_US-AZ_relOct-2018". Thanks for your quick reply and all your work on this package!
Looks like it's not the "–" character, but a strange character that comes after it that's causing the problem. Seems it's an "embedded nul", which is discussed in this StackOverflow question. I don't think it's something that can easily be dealt with in R, but we may be able to process the text file prior to download to avoid these problems in future. I'll need to think about this a little more. Thanks for bringing it to my attention!
Wow, I had no idea that"embedded nul" existed! This is good to know for future reference. So is this an encoding thing coming from the eBird app? I just noticed this particular record was coming from "eBird for iOS, version 1.5.149".
I sometimes run into weird encodings (not in eBird in particular, but other data) that are fixed by stringi::stri_trans_general(<string>, 'latin-ascii')
. But this seems a different issue.
I am having the same problem. When filtering the Dec 2018 ebd file then reading the resulting TXT file with read_ebd(). Filtering works fine. But reading into R breaks. I get an fread error:
Warning message: In data.table::fread(x, sep = sep, quote = "", na.strings = "", : Stopped early on line 1348203. Expected 47 fields but found 46. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<URN:CornellLabOfOrnithology:EBIRD:OBS545330372 2018-08-06 16:15:54 19232 species Red-eyed Vireo Vireo olivaceus 1 United States US Arizona US-AZ Maricopa US-AZ-013 US-AZ_3068 33 Riparian Preserve at Gilbert Water Ranch L144858 H 33.3614502 -111.7339478 2017-10-29 12:32:00 obsr218773 S40195602 Stationary P21 EBIRD 45 1 0 0 1 1 Targeted: REVI —>>
Definitely an encoding issue. The warning message suggests tweaking some fread arguments. So I tried:
dat1 <- read_ebd(f_out_ebd, reader="fread", unique=F, rollup=F, fill=T)
and got the error:
Error in read_ebd(f_out_ebd, reader = "fread", unique = F, rollup = F, : unused argument (fill = T)
Is it possible to pass fread arguments through the read_ebd() function, so we can work around the issue? Thanks for the great package!
Here is the reproducible example code for Windows 10 machine, recent R and auk installs:
library(tidyverse) library(auk)
f_ebd <- "ebd_relDec-2018.txt" target_bbox <- c(-180, -90, 0, 90) target_date <- c("1980-01-01", "2018-12-31") target_species <- c("Red-eyed Vireo")
ebd_filters <- auk_ebd(f_ebd) %>% auk_bbox(bbox=target_bbox) %>% auk_date(date=target_date) %>% auk_species(species=target_species)
f_out_ebd <- "ebd_test.txt" ebd_filtered <- auk_filter(ebd_filters, file=f_out_ebd, overwrite=T)
dat1 <- read_ebd(f_out_ebd, unique=F, rollup=F) dat1 <- read_ebd(f_out_ebd, reader="fread", unique=F, rollup=F, fill=T)
I don't think it's something that can easily be dealt with in R, but we may be able to process the text file prior to download to avoid these problems in future.
Someone I work with found a way to do this with the Unix program tr. Would it be helpful for me to pass on that code? Doing this in advance sounds like a great service to eBird users.
A user found a bug in
auk_filter()
resulting from Turkish characters (e.g. the "İ" inİbrahim
) in the EBD path.