support for offline processing

jhpoelen commented 2 years ago

Hi!

Is it possible to use https://www.checklistbank.org/tools/name-match offline?

My use case is to match millions of names in automated workflows, and I was wondering how I'd use your name-match tool without having to use some webapi .

fyi @myrmoteras

mdoering commented 2 years ago

No, you cannot use ChecklistBank offline. But you could use its API in your workflows if you have access to the internet.

It would be possible to bundle some docker image of the backend and the checklistbank database and filesystem. But the data is the problem there mostly. Might be a good idea to provide that for the annual COL releases in the future

jhpoelen commented 2 years ago

@mdoering Thanks for taking the time to respond to my question.

No, you cannot use ChecklistBank offline. But you could use its API in your workflows if you have access to the internet.

Thanks for confirming. As I mentioned before, I try to design / implement workflows that are able to work offline, because they are cheaper for me to maintain: I can reproduce workflows and implement automated integration tests. Also, I can scale to match hundreds / thousands names per seconds and distribute the tools as part of automated integration tests for checklists (e.g., see https://github.com/globalbioticinteractions/name-alignment-template) .

In the past, I used web apis, and found that they change outside my control (by design) and are rate limited. So, this worked for aligning 1000s of names in a point-and-click workflow, but for automated schemes I found that I spent a lot of time keeping track of performance logs, errors logs, and waiting for results to appear.

But the data is the problem there mostly.

I like your idea to somehow package the checklist bank for independent re-use. Curious to hear more about why the data is a problem.

mdoering commented 2 years ago

But the data is the problem there mostly.

I like your idea to somehow package the checklist bank for independent re-use. Curious to hear more about why the data is a problem.

Well, bundling a docker image for the software and an empty database could be done, but snapshoting all of checklistbank would be really large and hard to do consistently as it is constantly changing. Exporting only the annual checklist COL releases make sense. But it is difficult to extract just those from the database as they are internally related in the postgres tables. Also some things are stored in the filesystem, e.g. metrics, names trees which I have explained here: https://github.com/CatalogueOfLife/backend/issues/1163

Bringing this all together is not an easy task. Maybe just offering coldp archives that one could then import into a clean CLB setup would be sth?

CatalogueOfLife / checklistbank

support for offline processing #1097