CornellLabofOrnithology / ebird-best-practices

Best Practices for Using eBird Data
https://CornellLabOfOrnithology.github.io/ebird-best-practices/
Other
32 stars 12 forks source link

Error in zero filling #26

Open Niharika-M opened 1 year ago

Niharika-M commented 1 year ago

Hi

Am using the June 2023 ebird data for India. Am following all the steps on the best practices and getting the following error on R version 4.3.1:

zero-filling

ebd_zf <- auk_zerofill(f_ebd, f_sampling, collapse = TRUE)

Error: cannot allocate vector of size 1.2 Gb

Tried with individual state data like CT which has smaller file size about 270MB, yet the error persists. Is there a way around to avoid this error?

I have a windows laptop with 16GB RAM

Thanks

mstrimas commented 1 year ago

I would try zero-filling each species individually in a for loop

Niharika-M commented 1 year ago

Hi

To create a for loop i first need to get the list of unique scientific names from the file Am using the following code and its giving me an error:

ebd_all = read.table("D:/eBird/ebd_IN_smp_relJun-2023.txt") Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 2 did not have 97 elements

Thought of using the ebird_taxonomy list to get the species list, which resulted in error again:

ebd_species = unique(ebird_taxonomy$scientific_name) ebd_filters <- ebd %>%

( select individual species from the list)

auk_species(ebd_species) Error in auk_species.auk_ebd(., ebd_species) : Cannot extract taxa identified below species. Remove the following taxa or replace with species: Rhea pennata tarapacensis/garleppi, Rhea pennata pennata, Nothocercus bonapartei [bonapartei Group], Nothocercus bonapartei frantzii, (this list goes on covering all the species)

Is there a way to create a loop without accessing the original file?

Thanks

mstrimas commented 1 year ago

How large is the file you're working with? If it's possible to read the whole thing into R, that's probably the easiest route. You could use read_ebd("D:/eBird/ebd_IN_smp_relJun-2023.txt", unique = FALSE, rollup = FALSE). I use unique = FALSE, rollup = FALSE because these operations are quite slow on large datasets, you will need to apply them later when you're looping through species with auk_unique() and auk_rollup().

If the file is large and you want to only read in one column, you can do so with:

library(auk)
library(readr)
library(dplyr)

species <- read_tsv("D:/eBird/ebd_IN_smp_relJun-2023.txt", 
                    col_select = "SCIENTIFIC NAME") %>% 
  distinct(scientific_name = `SCIENTIFIC NAME`) %>% 
  inner_join(ebird_taxonomy, by = "scientific_name") %>% 
  filter(category == "species") %>% 
  select(species_code, scientific_name, common_name)

Let me know if either of those solutions work.

Niharika-M commented 1 year ago

Hi

Thanks for the solutions. Made a few modifications to suit my requirements and it worked to get the species list. Although my initial issue remains unresolved. Now facing the following error:

if (!file.exists(f_ebd)) {

  • auk_filter(ebd_filters, file = f_ebd, file_sampling = f_sampling)
  • } Error in auk_filter.auk_ebd(ebd_filters, file = f_ebd, file_sampling = f_sampling) : Error running AWK command.

    importing and zero-filling

    ebd_zf <- auk_zerofill(f_ebd, f_sampling, collapse = TRUE) Error: length(readLines(x, 2)) not greater than 1

A friend of mine ran the code on a Mac attached to a server and it worked. But he got two files instead of one as mentioned on the website. My windows system has Cygwin installed yet it is showing the above error. It is convenient to use the output (larger file) from my friend but am hoping to do it on my device.

Also, am trying the steps mentioned on https://ebird.github.io/ebird-best-practices/ebird.html This is also resulting in a memory usage error, saying cannot allocate vector of size n Mb.

(Initially tried https://cornelllabofornithology.github.io/ebird-best-practices/ebird.html)

Another attempt was to use state-wise data. This resulted in an error saying both ebd and sampling data were different. The data used were from the original downloads from the website with no modifications.

Cheers!

mstrimas commented 1 year ago

It's tricky for me to troubleshoot these issues since I don't have a Windows computer and don't have access to your data and code. I can say auk_filter() is supposed to produce 2 files, one for the EBD (observations) and one for the sampling event data (checklists). To troubleshoot the other issues I'd need your exact code and data uploaded somewhere I could access it. I suggest trying to reproduce the same problem with a smaller example dataset.

Niharika-M commented 1 year ago

Hi

Thanks for pointing out to me that auk_filter() produces 2 files, unlike whats mentioned on the website. This makes me confident to use the output generated from my friends' machine and thus solving my current issue.

Cheers!