LimaRAF / plantR

An R Package for Managing Species Records from Biological Collections
GNU General Public License v3.0
17 stars 4 forks source link

Error in formatLoc function #100

Closed jaum20 closed 4 months ago

jaum20 commented 1 year ago

PlantR version: 0.1.6

formatLoc(occs.all.2)

Error in FUN(X[[i]], ...) : 'pattern' é inválido em UTF-8

occs.all.2.gz

O.S = Ubuntu 22

ggrittz commented 1 year ago

I'm having the same issue. I just came here to create a thread.

An example and minimal data set of only one family

dados <- readData(file = "0026229-230810091245214.zip",
                  path = "https://api.gbif.org/v1/occurrence/download/request/")
dados <- dados$occurrence

occs <- formatDwc(gbif_data = dados, drop = TRUE)
occs <- formatOcc(occs) #all good here
occs <- formatLoc(occs) #the same error mentioned by @jaum20

I tried to manually change the encoding but it got me nowhere.

version and sessionInfo() below:

> version
               _                                
platform       x86_64-w64-mingw32               
arch           x86_64                           
os             mingw32                          
crt            ucrt                             
system         x86_64, mingw32                  
status                                          
major          4                                
minor          3.1                              
year           2023                             
month          06                               
day            16                               
svn rev        84548                            
language       R                                
version.string R version 4.3.1 (2023-06-16 ucrt)
nickname       Beagle Scouts
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.utf8  LC_CTYPE=Portuguese_Brazil.utf8    LC_MONETARY=Portuguese_Brazil.utf8
[4] LC_NUMERIC=C                       LC_TIME=Portuguese_Brazil.utf8    

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2     purrr_1.0.2     readr_2.1.4    
 [7] tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.3   tidyverse_2.0.0 plantR_0.1.6

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0   viridisLite_0.4.2  viridis_0.6.4      fastmap_1.1.1      lazyeval_0.2.2    
 [6] leaflet_2.1.2      spatialrisk_0.7.0  XML_3.99-0.14      digest_0.6.33      timechange_0.2.0  
[11] lifecycle_1.0.3    sf_1.0-14          terra_1.7-39       magrittr_2.0.3     compiler_4.3.1    
[16] rlang_1.1.1        tools_4.3.1        igraph_1.5.1       utf8_1.2.3         data.table_1.14.8 
[21] knitr_1.43         htmlwidgets_1.6.2  bit_4.0.5          sp_2.0-0           classInt_0.4-9    
[26] plyr_1.8.8         xml2_1.3.5         RColorBrewer_1.1-3 abind_1.4-5        KernSmooth_2.23-21
[31] withr_2.5.0        leafsync_0.1.0     grid_4.3.1         fansi_1.0.4        e1071_1.7-13      
[36] leafem_0.2.0       colorspace_2.1-0   scales_1.2.1       dichromat_2.0-0.1  cli_3.6.1         
[41] generics_0.1.3     stringdist_0.9.10  rstudioapi_0.15.0  robustbase_0.99-0  tzdb_0.4.0        
[46] rgbif_3.7.7        httr_1.4.7         tmaptools_3.1-1    DBI_1.1.3          pbapply_1.7-2     
[51] proxy_0.4-27       stars_0.6-3        RcppProgress_0.4.2 parallel_4.3.1     base64enc_0.1-3   
[56] vctrs_0.6.3        jsonlite_1.8.7     flora_0.3.7        hms_1.1.3          bit64_4.0.5       
[61] GenSA_1.1.9        crosstalk_1.2.0    units_0.8-3        leafgl_0.1.1       glue_1.6.2        
[66] lwgeom_0.2-13      DEoptimR_1.1-1     codetools_0.2-19   stringi_1.7.12     countrycode_1.5.0 
[71] gtable_0.3.3       raster_3.6-23      Taxonstand_2.4     munsell_0.5.0      pillar_1.9.0      
[76] htmltools_0.5.6    R6_2.5.1           oai_0.4.0          lattice_0.21-8     png_0.1-8         
[81] tmap_3.3-3         geohashTools_0.3.2 colourvalues_0.3.9 class_7.3-22       Rcpp_1.0.11       
[86] gridExtra_2.3      whisker_0.4.1      xfun_0.40          fs_1.6.3           pkgconfig_2.0.3 
jaum20 commented 1 year ago

The error does not existis in Windows 10, only in Linux (at least on my machine). Maybe related to this

wevertonbio commented 1 year ago

It appears there is an issue related to the unwantedEncoding object imported within the fixLoc function. On my Windows 11 system, it is displayed as follows:

plantR:::unwantedEncoding \xe3\xa1 \xe3\xa2 \xe3\xa3 &#225; \xe3\xa7 \xe3\xa9 \xe3\xaa \xe3\xb4 \xe3\x8d \xe3\xba "a" "a" "a" "a" "c" "e" "e" "o" "i" "u" The error occurs here.

While the problem is not solved in the package, I modified the formatLoc function (now formatLoc2) and it's working.

formatLoc2.txt

jaum20 commented 1 year ago

Another workaround is change R locale from utf-8 to latin1 before run formatLoc()

KPHendriks commented 6 months ago

Dear Jaum20,

I am facing the same error message with this function on my Mac w/ OS14. I've been looking into your suggested workaround, but am not quite sure how to do this. Should I change de encoding for the specific object, the columns used by the function, or rather for the R environment?

Hope you have further suggestions.

Many thanks,

Kasper

jaum20 commented 6 months ago

for the R environment?

This. You can per instance (I use brazilian portuguese):

Sys.setlocale("LC_COLLATE", "Portuguese_Brazil.1252")
Sys.setlocale("LC_CTYPE", "Portuguese_Brazil.1252")
Sys.setlocale("LC_MONETARY", "Portuguese_Brazil.1252")
Sys.setlocale("LC_TIME", "Portuguese_Brazil.1252")

After you successfully pass the formatLoc line on your script you can change it back:

Sys.setlocale("LC_COLLATE", "Portuguese_Brazil.utf8")
Sys.setlocale("LC_CTYPE", "Portuguese_Brazil.utf8")
Sys.setlocale("LC_MONETARY", "Portuguese_Brazil.utf8")
Sys.setlocale("LC_TIME", "Portuguese_Brazil.utf8")
KPHendriks commented 5 months ago

Dear Jaum20,

Thanks for the suggestion.

Unfortunately, I seem unable to change the locale settings on my MacBook running on OS 14. E.g.

> Sys.setlocale("LC_COLLATE", "Portuguese_Brazil.1252")
[1] ""
Warning message:
In Sys.setlocale("LC_COLLATE", "Portuguese_Brazil.1252") :
  OS reports request to set locale to "Portuguese_Brazil.1252" cannot be honored

I found a possible solution on StackOverflow (https://stackoverflow.com/questions/16347731/how-to-change-the-locale-of-r), where it was suggested to start an R environment with an update of the language settings as such:

LANGUAGE=Portuguese_Brazil.1252 R

This also did not change the locale settings in my R environment.

I am not sure how to continue. I used plantR before on my previous MacBook without any problems and would like to continue using it (with my very same script).

Any further suggestions are much welcome. :-)

Best wishes,

Kasper

jaum20 commented 5 months ago

This error you faced probabily occurs because you do no have the portuguese language installed in your system. Try using your default language and just change the encode. You can use the funcion Sys.getlocale() to see your current settings and then just change from utf8 to 1252 (latin1)

KPHendriks commented 5 months ago

Thanks for the suggestions, again. :-) I tried your suggestion (below) and also added Portugese (Brasil) the preferred languages of my MacBook (in the general settings; not sure if that is 'enough'). None gave the result I was hoping for. ;-)

If you have yet any further options, that would be much appreciated. Sorry for all the trouble.

All the best,

Kasper

> Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
> Sys.setlocale("LC_COLLATE", "en_US.1252")
[1] ""
Warning message:
In Sys.setlocale("LC_COLLATE", "en_US.1252") :
  OS reports request to set locale to "en_US.1252" cannot be honored
> Sys.setlocale("LC_CTYPE", "en_US.1252")
[1] ""
Warning message:
In Sys.setlocale("LC_CTYPE", "en_US.1252") :
  OS reports request to set locale to "en_US.1252" cannot be honored
> Sys.setlocale("LC_MONETARY", "en_US.1252")
[1] ""
Warning message:
In Sys.setlocale("LC_MONETARY", "en_US.1252") :
  OS reports request to set locale to "en_US.1252" cannot be honored
> Sys.setlocale("LC_TIME", "en_US.1252")
[1] ""
Warning message:
In Sys.setlocale("LC_TIME", "en_US.1252") :
  OS reports request to set locale to "en_US.1252" cannot be honored
> Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
jaum20 commented 5 months ago

Try this:

Sys.setlocale(category = "LC_ALL", locale = "English_United States.1252")

From here: https://stackoverflow.com/questions/20577764/set-locale-to-system-default-utf-8

LimaRAF commented 5 months ago

Dear all,

Thanks for the useful issue and the helpful comments and workarounds. And sorry but unfortunately I had very little time to maintain the package up to date.

R has changed the way in which it deals with the enconding of special characters and thus some of the functions need fixing.

I will look at it and try to fix the problem at its root asap.

eliane-anunciacao commented 5 months ago

I'm also in great need of using this package, especially because it allows data extraction from both GBIF and SpeciesLink simultaneously. I've been struggling with this for a few months now, and I would be immensely grateful if you could provide some updates to the package in general.

LimaRAF commented 4 months ago

There was indeed and encoding issue in plantR:::unwantedEncoding as mentioned by @wevertonbio. I think this is solved now (at least I cannot reproduce the error any more in my machine or in R CMD CHECK).

I kindly ask you to install the package from the development branch in which I pushed the new version 0.1.7 of the package. Use remotes::install_github("LimaRAF/plantR", ref = "dev") to get this new version.

Please let me know if everything is ok before I can close this issue and merge the dev branch into the master.

jaum20 commented 4 months ago

Working fine now! thanks!