Rachel853 / PlantPredictionML

Project which aims to predict plant species in a given location and time using various possible predictors using machine learning
0 stars 0 forks source link

Add datasets for species composition to repo #1

Open Rachel853 opened 5 months ago

ollyroberts commented 5 months ago

Data is pulled from https://www.kaggle.com/competitions/geolifeclef-2024/data, with the relevent

We will use the PO dataset to being, more info below

Presence-Absence (PA) surveys: including around 90 thousand surveys with roughly 10,000 species of the European flora. The presence-absence data (PA) is provided to compensate for the problem of false-absences of PO data and calibrate models to avoid associated biases. Presence-Only (PO) occurrences: combines around five million observations from numerous datasets gathered from the Global Biodiversity Information Facility (GBIF, www.gbif.org)/). This data constitutes the larger piece of the training data and covers all countries of our study area, but it has been sampled opportunistically (without standardized sampling protocol), leading to various sampling biases. The local absence of a species among PO data doesn't mean it is truly absent. An observer might not have reported it because it was difficult to "see" it at this time of the year, to identify it as not a monitoring target, or just unattractive.

There are two CSVs with species occurrence data on the Seafile available for training. The detailed description is provided again on SeaFile in separate ReadME files in relevant folders.

The PO metadata are available in PresenceOnlyOccurences/GLC24_PO_metadata_train.csv.
The PA metadata are available in PresenceAbsenceSurveys/GLC24_PA_metadata_train.csv.