CornellLabofOrnithology / auk

Working with eBird data in R
https://CornellLabofOrnithology.github.io/auk/
GNU General Public License v3.0
136 stars 20 forks source link

auk_split doesn't work #81

Open auman-chan opened 3 months ago

auman-chan commented 3 months ago

I met a problem that auk_split doesn't work . This function only exported files without any rows. Only I splited species before filter would it work. image.

Are there any solution or suggestion? Thanks!

image

mstrimas commented 3 months ago

Can you provide the output of sessionInfo() and the code that generated the above output?

auman-chan commented 3 months ago

Well, here is the code I used, and I conducted the sample of EBD from the EBD download website

 library(auk)
 library(dplyr)

 #list.files("test")
ebd_file <- "test/ebd_US-AL-101_202204_202204_relApr-2022.txt"
 ebd_out <- "test/output.txt"
 prefix_spe <- "test/spe_"
 prefix_spe2 <- "test/spe2_"
 ebd_in <- auk_ebd(file = ebd_file) 

 data <- ebd_in %>% 
  auk_complete() %>% 
  auk_year(c(2012, 2022)) %>% 
   auk_duration(duration = c(0, 300)) %>% 
   auk_distance(distance = c(0, 5)) %>% 
   auk_protocol(protocol=c("Stationary","Area"))

 df <- auk_filter(data,file = ebd_out,
                overwrite = T) %>% read_ebd()

splist <- unique(df$scientific_name)[1:5]
 splist
[1] "Cardinalis cardinalis"    "Mimus polyglottos"        "Poecile carolinensis"    
[4] "Sitta pusilla"            "Thryothorus ludovicianus"

 spe_split <- auk_split(file = ebd_out,
                       species = splist,
                       prefix = prefix_spe,
                        overwrite = T)

 list.files("test")
 [1] "ebd_US-AL-101_202204_202204_relApr-2022.txt" "output.txt"                                 
 [3] "spe_Cardinalis_cardinalis.txt"               "spe_Mimus_polyglottos.txt"                  
 [5] "spe_Poecile_carolinensis.txt"                "spe_Sitta_pusilla.txt"                      
 [7] "spe_Thryothorus_ludovicianus.txt"            "spe2_Cardinalis_cardinalis.txt"             
 [9] "spe2_Mimus_polyglottos.txt"                  "spe2_Poecile_carolinensis.txt"              
[11] "spe2_Sitta_pusilla.txt"                      "spe2_Thryothorus_ludovicianus.txt"          
[13] "test.R"                                     
 file.size("test/spe_Mimus_polyglottos.txt")
[1] 693

The split file with 693 B sizes means it only contains column names.

image

Here is the information of my sesison:

R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C               LC_TIME=zh_CN.UTF-8       
 [4] LC_COLLATE=zh_CN.UTF-8     LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=zh_CN.UTF-8   
 [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       

time zone: Asia/Shanghai
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4 auk_0.7.1  

loaded via a namespace (and not attached):
 [1] crayon_1.5.2      vctrs_0.6.5       cli_3.6.2         rlang_1.1.4       stringi_1.8.4    
 [6] generics_0.1.3    assertthat_0.2.1  glue_1.7.0        bit_4.0.5         hms_1.1.3        
[11] readxl_1.4.3      writexl_1.4.2     fansi_1.0.6       cellranger_1.1.0  tibble_3.2.1     
[16] tzdb_0.4.0        lifecycle_1.0.4   stringr_1.5.1     compiler_4.3.3    pkgconfig_2.0.3  
[21] rstudioapi_0.15.0 R6_2.5.1          readr_2.1.5       tidyselect_1.2.1  utf8_1.2.4       
[26] parallel_4.3.3    vroom_1.6.5       pillar_1.9.0      magrittr_2.0.3    withr_3.0.0      
[31] tools_4.3.3       bit64_4.0.5   
auman-chan commented 3 months ago

As an alternative, I selected species with the function read_delim_chunked, and remove duplicate group checklists and roll up taxonomy by the function distinct() and filter(). But I hope this pipeline could be fixed.

mstrimas commented 3 months ago

I ran your exact code on the sample EBD file and it appears to be working fine. I have a Mac, so I also tried running it in a Ubuntu Docker container to emulate your environment and also am not having any issues. For example, this is the file for Cardinal that I'm getting spe_Cardinalis_cardinalis.txt

It's hard to troubleshoot since I can replicate the issue... You might try running in a Docker container as well to test.

auman-chan commented 3 months ago

OK, I will have a try in Windows. Now I split species with the function read_delim_chunked and write_delim, and then imported them by read_ebd.

auman-chan commented 3 months ago

Additionally I have another question.The read_ebd or the auk_unique would only keep the distinct observations by the values of group_identifier, even though it is a missing value. It seems all group_identifier in NA are considered as duplicate and only keep one of them by read_ebd.

However, I don't think observations with the missing values in group_identifier are the duplicate, as they usually are in different locations and recorded in different time.

Is there any further meaning in this procedure?

mstrimas commented 3 months ago

I’m on vacation until June 25 so won’t be able to look into this in detail until then. However, auk_unique() shouldn’t impact rows that have NA for group_identifier. If that is happening, I’ll fix it when I return.

auman-chan commented 2 months ago

It seems that the bug is on the path of the ebd file.

When I read the file in other disks with the path like "/media/username/disk_name/ebird_data/ebd.txt",the split function works. But when my file in the system disk and evoked by "/home/username/R//imp/ebd.txt", this function returns empty files.

mstrimas commented 2 months ago

That's strange, I'm not sure what that would be happening and I'm not able to reproduce the issue. If you figure it out how to fix it let me know and I can make the change.

auman-chan commented 2 months ago

I know what's wrong. My fiolder name included a whitespace, it could be identified in R (detected by file.exists) but could not in awk. It had better provide a check to avoid this situation, as the error from the shell didn't pass to R console.

On the other hand, I will further confirm the question of auk_unique().