CornellLabofOrnithology / auk

Working with eBird data in R
https://CornellLabofOrnithology.github.io/auk/
GNU General Public License v3.0
136 stars 22 forks source link

Non-English characters in paths break the AWK call #26

Open mstrimas opened 5 years ago

mstrimas commented 5 years ago

A user found a bug in auk_filter() resulting from Turkish characters (e.g. the "İ" in İbrahim) in the EBD path.

matthewpaulking commented 5 years ago

A related issue just happened to me when trying to read the entire Basic Dataset for Arizona via data.tables. I wanted to add another example of a non-standard character breaking the fread() call:

Warning message:
In fread(file, showProgress = TRUE, skip = 5148250, nrows = 10,  :
  Stopped early on line 5148258. Expected 47 fields but found 46. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<URN:CornellLabOfOrnithology:EBIRD:OBS545330372    2018-08-06 16:15:54.0   19232   species Red-eyed Vireo  Vireo olivaceus         1               United States   US  Arizona US-AZ   Maricopa    US-AZ-013   US-AZ_3068  33          Riparian Preserve at Gilbert Water Ranch    L144858 H   33.3614502  -111.7339478    2017-10-29  12:32:00    obsr218773  S40195602   Stationary  P21 EBIRD   45          1   0       0   1   1           Targeted: REVI —>>

Here's a read_lines() call to show the raw string. You can see the last tab character is missing:

> read_lines(file, skip = 5148257, n_max = 1)
[1] "URN:CornellLabOfOrnithology:EBIRD:OBS545330372\t2018-08-06 16:15:54.0\t19232\tspecies\tRed-eyed Vireo\tVireo olivaceus\t\t\t1\t\t\t\tUnited States\tUS\tArizona\tUS-AZ\tMaricopa\tUS-AZ-013\tUS-AZ_3068\t33\t\t\tRiparian Preserve at Gilbert Water Ranch\tL144858\tH\t33.3614502\t-111.7339478\t2017-10-29\t12:32:00\tobsr218773\tS40195602\tStationary\tP21\tEBIRD\t45\t\t\t1\t0\t\t0\t1\t1\t\t\tTargeted: REVI —"

Checking out the checklist record on the eBird website, the last field is supposed to continue on:

https://ebird.org/view/checklist/S40195602

image

Something about that long-dash is cutting off the rest of the data in the field, including the last tab character. Thankfully this was the only record with an issue in 12 million+ records, and I just fixed the line manually. I was able to read the entire data file after this.

I'm not sure if these errors need be fixed in the database itself since it's cutting off the data before it gets to the end user. Hopefully this issue is an appropriate venue for my problem since it's probably not auk specific.

mstrimas commented 5 years ago

@matthewpaulking this valuable info, thanks! Can you let me know which version of the EBD you're using, and whether it's the full (200 GB) file or if you just downloaded the Arizona subset.

matthewpaulking commented 5 years ago

Sure thing! I'm using just the Arizona subset (4.4 GB), and it's from October 2018: "ebd_US-AZ_relOct-2018". Thanks for your quick reply and all your work on this package!

mstrimas commented 5 years ago

Looks like it's not the "–" character, but a strange character that comes after it that's causing the problem. Seems it's an "embedded nul", which is discussed in this StackOverflow question. I don't think it's something that can easily be dealt with in R, but we may be able to process the text file prior to download to avoid these problems in future. I'll need to think about this a little more. Thanks for bringing it to my attention!

matthewpaulking commented 5 years ago

Wow, I had no idea that"embedded nul" existed! This is good to know for future reference. So is this an encoding thing coming from the eBird app? I just noticed this particular record was coming from "eBird for iOS, version 1.5.149".

I sometimes run into weird encodings (not in eBird in particular, but other data) that are fixed by stringi::stri_trans_general(<string>, 'latin-ascii'). But this seems a different issue.

tmeeha commented 5 years ago

I am having the same problem. When filtering the Dec 2018 ebd file then reading the resulting TXT file with read_ebd(). Filtering works fine. But reading into R breaks. I get an fread error:

Warning message: In data.table::fread(x, sep = sep, quote = "", na.strings = "", : Stopped early on line 1348203. Expected 47 fields but found 46. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<URN:CornellLabOfOrnithology:EBIRD:OBS545330372 2018-08-06 16:15:54 19232 species Red-eyed Vireo Vireo olivaceus 1 United States US Arizona US-AZ Maricopa US-AZ-013 US-AZ_3068 33 Riparian Preserve at Gilbert Water Ranch L144858 H 33.3614502 -111.7339478 2017-10-29 12:32:00 obsr218773 S40195602 Stationary P21 EBIRD 45 1 0 0 1 1 Targeted: REVI —>>

Definitely an encoding issue. The warning message suggests tweaking some fread arguments. So I tried:

dat1 <- read_ebd(f_out_ebd, reader="fread", unique=F, rollup=F, fill=T)

and got the error:

Error in read_ebd(f_out_ebd, reader = "fread", unique = F, rollup = F, : unused argument (fill = T)

Is it possible to pass fread arguments through the read_ebd() function, so we can work around the issue? Thanks for the great package!

tmeeha commented 5 years ago

Here is the reproducible example code for Windows 10 machine, recent R and auk installs:

library(tidyverse) library(auk)

f_ebd <- "ebd_relDec-2018.txt" target_bbox <- c(-180, -90, 0, 90) target_date <- c("1980-01-01", "2018-12-31") target_species <- c("Red-eyed Vireo")

ebd_filters <- auk_ebd(f_ebd) %>% auk_bbox(bbox=target_bbox) %>% auk_date(date=target_date) %>% auk_species(species=target_species)

f_out_ebd <- "ebd_test.txt" ebd_filtered <- auk_filter(ebd_filters, file=f_out_ebd, overwrite=T)

dat1 <- read_ebd(f_out_ebd, unique=F, rollup=F) dat1 <- read_ebd(f_out_ebd, reader="fread", unique=F, rollup=F, fill=T)

tmeeha commented 5 years ago

I don't think it's something that can easily be dealt with in R, but we may be able to process the text file prior to download to avoid these problems in future.

Someone I work with found a way to do this with the Unix program tr. Would it be helpful for me to pass on that code? Doing this in advance sounds like a great service to eBird users.