Task - Plazi diff - Githubissues

mjy commented 3 years ago

As a curator I want to understand how my curated data intersect with data processed by PLAZI.

Potential diffs:

Project level (summary comparisons)

Serials I reference that do/not have templates in PLAZI

Instance level (individual record, and their immediate related data comparisons)

Sources used in Citations in TW that do/not have corresponding processing in PLAZI
Names in TW from a specific source that do not have correpsonding entries in resources derived from PLAZI
Sources used in citations that do not have a document linked to them but the document is available in a PLAZI resource

proceps commented 3 years ago

I would like to have a subclass of data in the DB "from external source", from Plazi in this case. And there could be many instances. For example, in 3i, I keep track of all nomenclatural changes, but I do not keep track of all citations of a particular name. But I would like to see all citations for this name from all available PDF sources, and PLAZI can help with it. From those sources, I would like to see all publications, which have, illustrations, I want to see distribution records, etc. It does not necessarily I want to append this automatically to the DB, but this could be an option.

mjy commented 3 years ago

@mguidoti do you by chance have a summary of all the PLAZI APIs compiled somewhere? One that would also include 1 step removed things like Zenodo, GBIF?

mguidoti commented 3 years ago

Hi @mjy, I'm not sure if I understand your question.. '1 step removed things'?

However, by reading this issue I think I've some comments to make that might be helpful/interesting for you guys.

You could check whether a source exists in Plazi TreatmentBank by querying the Article API (dioStats) using the article metadata, and that would cover your first and third use case scenarios. The second one would have to hit a different endpoint, the Treatment API (srsStats), with the taxon name, so you could retrieve all treatments (and associated data) therein for that specific taxon name. But there is something to highlight here.

@proceps mentioned two use cases himself, one regarding hitting a source with a full-text query to find any instance of a taxon name, and the second about retrieving specific information associated with a taxon name. To the latter, the treatment API will do it, because distribution records, illustrations and etc, when associated with a taxon name, are rarely not included or cited by a taxonomic treatment. The first, although we do annotate taxonomic names anywhere in the document, we do not have an endpoint to query it like that. The TreatmentBank search would be the closest to it in the moment (because it's a full-text search..). It returns treatments that mention the queried taxon name, but say, if the article mentioned it on its Introduction, it won't show up as a result. It's a TreatmentBank, after all.

Plus, I'll have to say that the idea of finding which serials that has or has not templates in Plazi servers is not relevant, I suppose, because the lack of a template is not an impediment to processing that very same document. We can process things without templates. The output might require more manual curation, but, well, I think this is one of the ideas on the table, right? To have this feedback loop in order to not only display Plazi data, but harvest the annotations made by your curators and include them in Treatment Bank as well, right?

Regarding API documentation, we're working on it, it will be released as a simple static webpage but you can access the initial notes for these two endpoints here and here. Note that these are md files with the initial documentation effort that will feed the static pages later on. Needless to say, I would be more than happy to walk you through them and help build the queries you need for your initial tests and exploration.

debpaul commented 3 years ago

@mguidoti thanks for your list of APIs. @mjy @dimus @gdower will do same for our end.

SpeciesFileGroup / taxonworks

Task - Plazi diff #2260

Project level (summary comparisons)

Instance level (individual record, and their immediate related data comparisons)