VertNet / bels

Biodiversity Enhanced Location Services
Apache License 2.0
16 stars 1 forks source link

download BELS for offline use #38

Open jhpoelen opened 2 years ago

jhpoelen commented 2 years ago

Hi!

Julie Allen mentioned BELS at the Entomology Collections Management Workshop on 22 June 2022.

As you probably expect, many interaction records indexed by GloBI have locality data, but no specific coordinates or standardized locality entries.

Is there a way to setup BELS for offline use, so I can run millions of localities without having to worry about network delays and/or resource throttling?

Also, how do you version the BELS corpus?

Looking forward to trying to use BELS as a way to enrich location information in GloBI.

Also see https://github.com/globalbioticinteractions/globalbioticinteractions/issues/801 .

tucotuco commented 2 years ago

Hi @jhpoelen. Great that BELS could be useful for GloBI.

Is there a way to setup BELS for offline use, so I can run millions of localities without having to worry about network delays and/or resource throttling?

BELS requires rather serious infrastructure, and is built on Google BigQuery, App Engine, and Cloud Function. An offline infrastructure could be built to do the matching, but it would require a lot of development and a very robust database. The DigiLeap Project does not currently have a plan to build and manage an offline version.

However, it is possible to run a very big job manually to avoid the 32 MB file upload limitation of the web app. If you are interested in doing that, I believe it can be done as a collaboration between GloBI and the Terrestrial Parasite Tracker Project. If that is of interest, please contact me directly via email.

Also, how do you version the BELS corpus?

The gazetteer is built on periodic snapshots from GBIF and iDigBio, plus the static gazetteer coming out of the VertNet predecessor collaborative georeferencing projects MaNIS, HerpNET, and ORNIS. When a new version is built, previous versions of the gazetteer are not kept. It is debatable whether there is any value in doing so, as the best matching georeferences are a point of departure for verification or improvement (getting you "on the map" to work with the answers visually), not an end in themselves.

jhpoelen commented 2 years ago

@tucotuco thanks for taking the time to respond.

BELS requires rather serious infrastructure, and is built on Google BigQuery, App Engine, and Cloud Function

Serious infrastructure sounds like fun!

I imagine your workflow looks something like:

external data sources -> (some script/program) -> dataset of location data

training dataset -> (some training algorithm) -> some model

and then,

some model -> (some import data) -> live services

It is debatable whether there is any value in doing so, as the best matching georeferences are a point of departure for verification or improvement (getting you "on the map" to work with the answers visually), not an end in themselves.

Good point - not all things have to be kept, especially when experimenting. However, I would imagine that it is important to keep track of versions of input data, and their associated models that are run in production. How can folks otherwise set specific baselines, cite BELS, or ask specific questions? And, as a developer, I am keen to keep track of versions because it help me trace the origin (or provenance) of data errors or suspicious data.

Just curious - What is the volume of your input data?

What is the volume of your model?

How much would I have to pay to support the infrastructure needed to run BELS?