LimaRAF / plantR

An R Package for Managing Species Records from Biological Collections
GNU General Public License v3.0
18 stars 4 forks source link

formatDwc() — "can't combine" #84

Closed ggrittz closed 2 years ago

ggrittz commented 2 years ago

Opa, Geralmente envio e-mails direto ao Renato, mas talvez por aqui seja mais adequado. (escreverei em pt-br; caso prefiram em inglês, por motivos de acessibilidade, posso reescrever depois — e também para as próximas vezes)

Estou tendo um problema no formatDwc() quando utilizo dados advindos da função rspeciesLink() e readData() (em um .zip que baixei direto do gbif).

Segue o código

#GBIF points for SC (Tracheophyta; GBIF)
plants_gbif <- readData(file = 'gbif_plants.zip', path = 'C:/Users/Master/OneDrive - FURB/Mestrado')
#Selecting only the first list
plants_gbif <- plants_gbif$occurrence 

#INCT points for all SC plants
plants_inct <- rspeciesLink(Scope = "plants",
                            basisOfRecord = "PreservedSpecimen",
                            Synonyms = "flora2020",
                            stateProvince = "Santa Catarina")

plants_inct2 <- plants_inct

#Preparing input data in the correct format
occs <- formatDwc(gbif_data = plants_gbif,
                  splink_data = plants_inct2,
                  drop = TRUE)

Isso me retorna o seguinte erro: Erro: Can't combine gbif$eventDate <datetime> and speciesLink$eventDate .

Abaixo o rlang::last_trace():

> rlang::last_trace()
<error/vctrs_error_incompatible_type>
Can't combine `gbif$eventDate` <datetime<UTC>> and `speciesLink$eventDate` <character>.
Backtrace:
    x
 1. \-plantR::formatDwc(...)
 2.   \-dplyr::bind_rows(res_list, .id = "data_source")
 3.     \-vctrs::vec_rbind(!!!dots, .names_to = .id)
 4.       \-(function () ...
 5.         \-vctrs::vec_default_ptype2(...)
 6.           \-vctrs::stop_incompatible_type(...)
 7.             \-vctrs:::stop_incompatible(...)
 8.               \-vctrs:::stop_vctrs(...)

Se eu não utilizar o readData(), e sim o rgbif2() direto, o erro não rola para um exemplo menor (não cheguei a testar com todos os meus dados porque demoraria um pouco para baixar todas as espécies no momento — mas posso botar para rodar aqui caso necessário. No entanto, como os dados do GBIF vieram direto do .zip pelo readData(), teoricamente, são iguais aos obtidos pela função rgbif2, certo? Uma outra pequena diferença entre os dados do GBIF e do SpeciesLink é que no GBIF peguei apenas plantas vasculares, enquanto no SpeciesLink qualquer tipo de planta — não sei se essa informação será útil, mas vai que serve de algo.

Quanto às colunas obtidas em cada database, de acordo com as funções que usei acima:

Colunas dos dados do GBIF obtidos pela função readData()

[1] "gbifID"                              "abstract"                            "accessRights"                       
  [4] "accrualMethod"                       "accrualPeriodicity"                  "accrualPolicy"                      
  [7] "alternative"                         "audience"                            "available"                          
 [10] "bibliographicCitation"               "conformsTo"                          "contributor"                        
 [13] "coverage"                            "created"                             "creator"                            
 [16] "date"                                "dateAccepted"                        "dateCopyrighted"                    
 [19] "dateSubmitted"                       "description"                         "educationLevel"                     
 [22] "extent"                              "format"                              "hasFormat"                          
 [25] "hasPart"                             "hasVersion"                          "identifier"                         
 [28] "instructionalMethod"                 "isFormatOf"                          "isPartOf"                           
 [31] "isReferencedBy"                      "isReplacedBy"                        "isRequiredBy"                       
 [34] "isVersionOf"                         "issued"                              "language"                           
 [37] "license"                             "mediator"                            "medium"                             
 [40] "modified"                            "provenance"                          "publisher"                          
 [43] "references"                          "relation"                            "replaces"                           
 [46] "requires"                            "rights"                              "rightsHolder"                       
 [49] "source"                              "spatial"                             "subject"                            
 [52] "tableOfContents"                     "temporal"                            "title"                              
 [55] "type"                                "valid"                               "institutionID"                      
 [58] "collectionID"                        "datasetID"                           "institutionCode"                    
 [61] "collectionCode"                      "datasetName"                         "ownerInstitutionCode"               
 [64] "basisOfRecord"                       "informationWithheld"                 "dataGeneralizations"                
 [67] "dynamicProperties"                   "occurrenceID"                        "catalogNumber"                      
 [70] "recordNumber"                        "recordedBy"                          "individualCount"                    
 [73] "organismQuantity"                    "organismQuantityType"                "sex"                                
 [76] "lifeStage"                           "reproductiveCondition"               "behavior"                           
 [79] "establishmentMeans"                  "occurrenceStatus"                    "preparations"                       
 [82] "disposition"                         "associatedReferences"                "associatedSequences"                
 [85] "associatedTaxa"                      "otherCatalogNumbers"                 "occurrenceRemarks"                  
 [88] "organismID"                          "organismName"                        "organismScope"                      
 [91] "associatedOccurrences"               "associatedOrganisms"                 "previousIdentifications"            
 [94] "organismRemarks"                     "materialSampleID"                    "eventID"                            
 [97] "parentEventID"                       "fieldNumber"                         "eventDate"                          
[100] "eventTime"                           "startDayOfYear"                      "endDayOfYear"                       
[103] "year"                                "month"                               "day"                                
[106] "verbatimEventDate"                   "habitat"                             "samplingProtocol"                   
[109] "samplingEffort"                      "sampleSizeValue"                     "sampleSizeUnit"                     
[112] "fieldNotes"                          "eventRemarks"                        "locationID"                         
[115] "higherGeographyID"                   "higherGeography"                     "continent"                          
[118] "waterBody"                           "islandGroup"                         "island"                             
[121] "countryCode"                         "stateProvince"                       "county"                             
[124] "municipality"                        "locality"                            "verbatimLocality"                   
[127] "verbatimElevation"                   "verbatimDepth"                       "minimumDistanceAboveSurfaceInMeters"
[130] "maximumDistanceAboveSurfaceInMeters" "locationAccordingTo"                 "locationRemarks"                    
[133] "decimalLatitude"                     "decimalLongitude"                    "coordinateUncertaintyInMeters"      
[136] "coordinatePrecision"                 "pointRadiusSpatialFit"               "verbatimCoordinateSystem"           
[139] "verbatimSRS"                         "footprintWKT"                        "footprintSRS"                       
[142] "footprintSpatialFit"                 "georeferencedBy"                     "georeferencedDate"                  
[145] "georeferenceProtocol"                "georeferenceSources"                 "georeferenceVerificationStatus"     
[148] "georeferenceRemarks"                 "geologicalContextID"                 "earliestEonOrLowestEonothem"        
[151] "latestEonOrHighestEonothem"          "earliestEraOrLowestErathem"          "latestEraOrHighestErathem"          
[154] "earliestPeriodOrLowestSystem"        "latestPeriodOrHighestSystem"         "earliestEpochOrLowestSeries"        
[157] "latestEpochOrHighestSeries"          "earliestAgeOrLowestStage"            "latestAgeOrHighestStage"            
[160] "lowestBiostratigraphicZone"          "highestBiostratigraphicZone"         "lithostratigraphicTerms"            
[163] "group"                               "formation"                           "member"                             
[166] "bed"                                 "identificationID"                    "identificationQualifier"            
[169] "typeStatus"                          "identifiedBy"                        "dateIdentified"                     
[172] "identificationReferences"            "identificationVerificationStatus"    "identificationRemarks"              
[175] "taxonID"                             "scientificNameID"                    "acceptedNameUsageID"                
[178] "parentNameUsageID"                   "originalNameUsageID"                 "nameAccordingToID"                  
[181] "namePublishedInID"                   "taxonConceptID"                      "scientificName"                     
[184] "acceptedNameUsage"                   "parentNameUsage"                     "originalNameUsage"                  
[187] "nameAccordingTo"                     "namePublishedIn"                     "namePublishedInYear"                
[190] "higherClassification"                "kingdom"                             "phylum"                             
[193] "class"                               "order"                               "family"                             
[196] "genus"                               "subgenus"                            "specificEpithet"                    
[199] "infraspecificEpithet"                "taxonRank"                           "verbatimTaxonRank"                  
[202] "vernacularName"                      "nomenclaturalCode"                   "taxonomicStatus"                    
[205] "nomenclaturalStatus"                 "taxonRemarks"                        "datasetKey"                         
[208] "publishingCountry"                   "lastInterpreted"                     "elevation"                          
[211] "elevationAccuracy"                   "depth"                               "depthAccuracy"                      
[214] "distanceAboveSurface"                "distanceAboveSurfaceAccuracy"        "issue"                              
[217] "mediaType"                           "hasCoordinate"                       "hasGeospatialIssues"                
[220] "taxonKey"                            "acceptedTaxonKey"                    "kingdomKey"                         
[223] "phylumKey"                           "classKey"                            "orderKey"                           
[226] "familyKey"                           "genusKey"                            "subgenusKey"                        
[229] "speciesKey"                          "species"                             "genericName"                        
[232] "acceptedScientificName"              "verbatimScientificName"              "typifiedName"                       
[235] "protocol"                            "lastParsed"                          "lastCrawled"                        
[238] "repatriated"                         "relativeOrganismQuantity"            "recordedByID"                       
[241] "identifiedByID"                      "level0Gid"                           "level0Name"                         
[244] "level1Gid"                           "level1Name"                          "level2Gid"                          
[247] "level2Name"                          "level3Gid"                           "level3Name"                         
[250] "iucnRedListCategory"                 "associatedMedia"                     "country"                            
[253] "minimumElevationInMeters"            "maximumElevationInMeters"            "minimumDepthInMeters"               
[256] "maximumDepthInMeters"                "geodeticDatum"                       "verbatimCoordinates"                
[259] "verbatimLatitude"                    "verbatimLongitude"                   "scientificNameAuthorship"

Colunas dos dados do INCT obtidos pela função rspeciesLink()

[1] "record_id"                "modified"                 "institutionCode"          "collectionCode"          
 [5] "catalogNumber"            "basisOfRecord"            "kingdom"                  "family"                  
 [9] "genus"                    "specificEpithet"          "scientificName"           "scientificNameAuthorship"
[13] "identifiedBy"             "recordedBy"               "year"                     "month"                   
[17] "day"                      "country"                  "stateProvince"            "county"                  
[21] "locality"                 "decimalLongitude"         "decimalLatitude"          "verbatimLongitude"       
[25] "verbatimLatitude"         "minimumElevationInMeters" "occurrenceRemarks"        "barcode"                 
[29] "imagecode"                "recordNumber"             "maximumElevationInMeters" "infraspecificEpithet"    
[33] "typeStatus"               "coordinatePrecision"      "geoFlag"                  "phylum"                  
[37] "order"                    "yearIdentified"           "monthIdentified"          "individualCount"         
[41] "class"                    "dayIdentified"            "continentOcean"           "preparationType"         
[45] "previousCatalogNumber"    "relatedCatalogItem"       "fieldNumber"              "minimumDepthInMeters"    
[49] "maximumDepthInMeters"     "sex"

Agradeço desde já.

LimaRAF commented 2 years ago

Oi @ggrittz

Obrigado pelo issue. Acho que eu sei de onde vem o problema. Ele vem do dplyr::bind_row que não combinada DFs cujas colunas que possuem o mesmo nome tenham categorias diferentes.

Eu achei que tivesse já resolvido isso. Portanto, peço que baixe a última versão do pacote que agora está no master branch mesmo. Se o problema persistir eu rodo aqui e tento resolver.

ggrittz commented 2 years ago

Blz, @LimaRAF?

O erro aqui continua o mesmo: Error: Can't combine gbif$eventDate <datetime> and speciesLink$eventDate .

LimaRAF commented 2 years ago

@ggrittz ok Guilherme, valeu pelo retorno rápido. Vou olhar isso então amanhã ou depois de amanhã. Vc poderia compartilhar o link para o download direto do GBIF? Assim consigo testar usando exatamente o seu conjunto de dados.

ggrittz commented 2 years ago

@ggrittz ok Guilherme, valeu pelo retorno rápido. Vou olhar isso então amanhã ou depois de amanhã. Vc poderia compartilhar o link para o download direto do GBIF? Assim consigo testar usando exatamente o seu conjunto de dados.

Bom dia, Renato @LimaRAF

Já tenho o zip upado, então te envio ele direto. Qlqr coisa é só me avisar.

Abraços,

Editei o link, acho que antes não conseguirias acessar. https://furb-my.sharepoint.com/:u:/g/personal/ggrittz_furb_br/EU3Ef7IjfAtJrdcStUsgPgMBqoHVZlM4uCV8RzoRuvQwUw?e=Y4QZ82

LimaRAF commented 2 years ago

Oi @ggrittz,

I finally managed to take a look a this issue, sorry for the delay.

Indeed, the error was related to dplyr::bind_rows() which does not bind columns from different df if the data are stored under different classes (character vs. dates in this case). This class was recently changed from what the GBIF API returns.

I just commited the changes made to the 'dev' branch. Please re-install the package from there and let me know if it works for you as well, so I can close the issue and merge the chnages with the master branch.

Just for the record, you can download the DwC-A zip file directly from GBIF using the function readData(). You just nee to provide the URL GBIF sent to you (that was actually what I asked you ealier). But I downloaded the zip from the drive you sent without problems.

Abs!

ggrittz commented 2 years ago

Blz, @LimaRAF?

Tudo certo agora, problema resolvido!

(é vdd, poderia ter mandado a url...)

LimaRAF commented 2 years ago

Tks!