Open auman-chan opened 3 months ago
Can you provide the output of sessionInfo()
and the code that generated the above output?
Well, here is the code I used, and I conducted the sample of EBD from the EBD download website
library(auk)
library(dplyr)
#list.files("test")
ebd_file <- "test/ebd_US-AL-101_202204_202204_relApr-2022.txt"
ebd_out <- "test/output.txt"
prefix_spe <- "test/spe_"
prefix_spe2 <- "test/spe2_"
ebd_in <- auk_ebd(file = ebd_file)
data <- ebd_in %>%
auk_complete() %>%
auk_year(c(2012, 2022)) %>%
auk_duration(duration = c(0, 300)) %>%
auk_distance(distance = c(0, 5)) %>%
auk_protocol(protocol=c("Stationary","Area"))
df <- auk_filter(data,file = ebd_out,
overwrite = T) %>% read_ebd()
splist <- unique(df$scientific_name)[1:5]
splist
[1] "Cardinalis cardinalis" "Mimus polyglottos" "Poecile carolinensis"
[4] "Sitta pusilla" "Thryothorus ludovicianus"
spe_split <- auk_split(file = ebd_out,
species = splist,
prefix = prefix_spe,
overwrite = T)
list.files("test")
[1] "ebd_US-AL-101_202204_202204_relApr-2022.txt" "output.txt"
[3] "spe_Cardinalis_cardinalis.txt" "spe_Mimus_polyglottos.txt"
[5] "spe_Poecile_carolinensis.txt" "spe_Sitta_pusilla.txt"
[7] "spe_Thryothorus_ludovicianus.txt" "spe2_Cardinalis_cardinalis.txt"
[9] "spe2_Mimus_polyglottos.txt" "spe2_Poecile_carolinensis.txt"
[11] "spe2_Sitta_pusilla.txt" "spe2_Thryothorus_ludovicianus.txt"
[13] "test.R"
file.size("test/spe_Mimus_polyglottos.txt")
[1] 693
The split file with 693 B sizes means it only contains column names.
Here is the information of my sesison:
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=zh_CN.UTF-8 LC_NUMERIC=C LC_TIME=zh_CN.UTF-8
[4] LC_COLLATE=zh_CN.UTF-8 LC_MONETARY=zh_CN.UTF-8 LC_MESSAGES=zh_CN.UTF-8
[7] LC_PAPER=zh_CN.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C
time zone: Asia/Shanghai
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.1.4 auk_0.7.1
loaded via a namespace (and not attached):
[1] crayon_1.5.2 vctrs_0.6.5 cli_3.6.2 rlang_1.1.4 stringi_1.8.4
[6] generics_0.1.3 assertthat_0.2.1 glue_1.7.0 bit_4.0.5 hms_1.1.3
[11] readxl_1.4.3 writexl_1.4.2 fansi_1.0.6 cellranger_1.1.0 tibble_3.2.1
[16] tzdb_0.4.0 lifecycle_1.0.4 stringr_1.5.1 compiler_4.3.3 pkgconfig_2.0.3
[21] rstudioapi_0.15.0 R6_2.5.1 readr_2.1.5 tidyselect_1.2.1 utf8_1.2.4
[26] parallel_4.3.3 vroom_1.6.5 pillar_1.9.0 magrittr_2.0.3 withr_3.0.0
[31] tools_4.3.3 bit64_4.0.5
As an alternative, I selected species with the function read_delim_chunked
, and remove duplicate group checklists and roll up taxonomy by the function distinct()
and filter()
. But I hope this pipeline could be fixed.
I ran your exact code on the sample EBD file and it appears to be working fine. I have a Mac, so I also tried running it in a Ubuntu Docker container to emulate your environment and also am not having any issues. For example, this is the file for Cardinal that I'm getting spe_Cardinalis_cardinalis.txt
It's hard to troubleshoot since I can replicate the issue... You might try running in a Docker container as well to test.
OK, I will have a try in Windows. Now I split species with the function read_delim_chunked
and write_delim
, and then imported them by read_ebd
.
Additionally I have another question.The read_ebd
or the auk_unique
would only keep the distinct observations by the values of group_identifier, even though it is a missing value. It seems all group_identifier in NA are considered as duplicate and only keep one of them by read_ebd
.
However, I don't think observations with the missing values in group_identifier are the duplicate, as they usually are in different locations and recorded in different time.
Is there any further meaning in this procedure?
I’m on vacation until June 25 so won’t be able to look into this in detail until then. However, auk_unique() shouldn’t impact rows that have NA for group_identifier. If that is happening, I’ll fix it when I return.
It seems that the bug is on the path of the ebd file.
When I read the file in other disks with the path like "/media/username/disk_name/ebird_data/ebd.txt",the split function works. But when my file in the system disk and evoked by "/home/username/R//imp/ebd.txt", this function returns empty files.
That's strange, I'm not sure what that would be happening and I'm not able to reproduce the issue. If you figure it out how to fix it let me know and I can make the change.
I know what's wrong. My fiolder name included a whitespace, it could be identified in R (detected by file.exists
) but could not in awk. It had better provide a check to avoid this situation, as the error from the shell didn't pass to R console.
On the other hand, I will further confirm the question of auk_unique()
.
I met a problem that auk_split doesn't work . This function only exported files without any rows. Only I splited species before filter would it work. image.
Are there any solution or suggestion? Thanks!