gbif / data-mobilization

For capturing and discussing potential datasets suitable for publishing to GBIF
Apache License 2.0
12 stars 2 forks source link

Aegypti Albopictus Mosquito Data #37

Open gbif-portal opened 7 years ago

gbif-portal commented 7 years ago

Aegypti Albopictus Mosquito Data

Dataset link: https://data.world/zika-virus-data/mosquito-data

Region: Global?

Taxon: Aedes

Type: occurrence

Why is this important: Prime candidate for dataset rescue covering an important gap: 22,137 occurrences of Aedes albopictus, 19,929 Aedes aegypti. Open and unpublished. 34,581 of them are noted as 'unpublished', 31,271 of them are given as point. Probably not the first thing to test the workflow with, but it's a good one for testing at scale.

Priority: medium

License: CC0 1.0

Users contact info: kcopas@gbif.org

rdmpage commented 7 years ago

Data.World is a pretty slick interface to data (except for reasons I don't understand it doesn't do maps). Imagine something similar for exploring GBIF occurrence data...

kcopas commented 7 years ago

Fwiw, the Data.World UI looks like an enhanced and customized version of box.com—which, if true, might explain the absence of a mapping interface.

ahahn-gbif commented 3 years ago

https://www.gbif.org/occurrence/search?taxon_key=1651430&taxon_key=1651891 about 103k occurrences at time of checking

CaroleSinou commented 2 years ago

From https://data.world/zika-virus-data/mosquito-data : "Data from: The global compendium of Aedes aegypti and Ae. albopictus occurrence from 1960-2014."

On gbif.org (Feb 2022), one dataset is available for Aedes aegypti (https://www.gbif.org/dataset/d4eb19bc-fdce-415f-9a61-49b036009840), comprising 19,929 occurrences, and one for Aedes albopictus (https://www.gbif.org/dataset/33614778-513a-4ec0-814d-125021cca5fe), containing 22,137 occurrences.

Both have been published by "Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow".

rdmpage commented 2 years ago

The data @CaroleSinou mentions is data that I uploaded from Kraemer et al. (see sources given for those two data sets in GBIF). The's clearly a lot of overlap with the data.world data, which is also based (mostly?) on Kraemer et al. Hence I suspect a good chunk of this data is already in GBIF.

The data.world data also includes records for Florida counties, I'm not sure what the provenance of that data is, but if that's going to be added to GBIF then it would be nice to figure that out.

dschigel commented 2 years ago

Yes, this data via Glasgow is likey via @rdmpage noticing GBIF is missing a lot of mosquito data, and acting on it. In fact, the ongoing helpdesk and task group activities are rooted in this useful exchange. @CaroleSinou @DimEvil please check of the GBIF representation of the originally reported data is adequate (in which case let's close this issue), or if there is a discrepancy, let's close it

dschigel commented 2 years ago

You type faster, @rdmpage :)

CaroleSinou commented 2 years ago

Quick checks between the source files available on data.world and the datasets on gbif reveal issues with occurrenceID.

Example: https://www.gbif.org/occurrence/1264880816, occurrenceID is 9276, country is Taiwan, year is 2011, but the same occurrenceID in the source file is linked to an occurrence in Brazil collected in 2013.

@rdmpage can you double-check the source file used to publish the dataset on GBIF?

In the source files, "mosquito_surveillance_51116" and "alachua_county_fla_mosquito_surveillance_51116" seems to be identicals. "alachua_county_fla_mosquito_surveillance_51116" and "alachua_county_fla_mosquito_surveillance_50616" are not published on GBIF (at least, I have not found them).

rdmpage commented 2 years ago

@CaroleSinou The data and code I used to publish the data to GBIF is available at https://github.com/rdmpage/global-distribution-arbovirus-vectors The data I used came from Dryad https://doi.org/10.5061/dryad.47v3c.

Looking at the original data there are three files, one each for the two species, and one with the species combined. The "occurrenceID" is simply the line number in the data file, so there is an occurrenceID of 9276 in the file for Aedes aegypti aegypti.csv (which I used for that species), and there is an occurrenceID of 9276 in the combined file aegypti_albopictus.csv (which I didn't use but which data.world did) and these bear no relation to each other.

So, just to be clear, the data.world data and the GBIF data come from the same source (https://doi.org/10.5061/dryad.47v3c) but I used the two individual files for each species whereas data.world used the combined file, but it's the same data.

rdmpage commented 2 years ago

@CaroleSinou The Florida data looks like it came from https://alachua.floridahealth.gov/programs-and-services/environmental-health/mosquito-prevention/mosquito-surveillance-and-data.html, see the files https://alachua.floridahealth.gov/programs-and-services/environmental-health/mosquito-prevention/_documents/mosquito-051116.pdf and https://alachua.floridahealth.gov/programs-and-services/environmental-health/mosquito-prevention/_documents/mosquito-050616.pdf. The numbering system for the files based on dates (e.g., 051116 is 11 May 2016). To me I would make more sense to attribute these data to the original source (the Florida Department of Health) if they are going to be added to GBIF.

CaroleSinou commented 2 years ago

@rdmpage got it! A classic case of multiplication of occurrenceIDs. Thanks for the heads up about the source of the Florida data. It indeed make more sense to attribute these datasets to that institution. I will get in touch with them.

CaroleSinou commented 2 years ago

Florida Health Department has been contacted. It's a bottle in the sea, but it's worth a try.