CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
14 stars 11 forks source link

Change test data file format for `matching-ws` #1317

Open djtfmartin opened 2 months ago

djtfmartin commented 2 months ago

To get the integration tests from gbif/checklistbank to work in the ported API involved assembling all the test data which is in format of the v1 species API in JSON, into a single CSV for the purposes of generating a test index.

Example of current format

{
    "usageKey": 1011638,
    "scientificName": "Abacion tesselatum Rafinesque, 1820",
    "canonicalName": "Abacion tesselatum",
    "rank": "SPECIES",
    "status": "ACCEPTED",
    "confidence": 100,
    "note": "Individual confidence: name=120; classification=-2; rank=0; status=1; singleMatch=10",
    "matchType": "EXACT",
    "kingdom": "Animalia",
    "phylum": "Arthropoda",
    "order": "Callipodida",
    "family": "Abacionidae",
    "genus": "Abacion",
    "species": "Abacion tesselatum",
    "kingdomKey": 1,
    "phylumKey": 54,
    "classKey": 361,
    "orderKey": 501,
    "familyKey": 7228,
    "genusKey": 1011637,
    "speciesKey": 1011638,
    "synonym": false,
    "class": "Diplopoda"
}

It may make sense to replace all test data in the nub.json files with a single CSV, or replace the nub.json files with small CSVs (which might be easier to maintain).

Another option would be potentially switch to the texttree format.

mdoering commented 2 months ago

It was very convenient to use the matching response format to build the index. You could simply try out name matches that were problematic on the gbif API, store them locally as a seed to the index and create tests for it. Then work to fix the matching to respond as desired.

I fear that having the data in CSV or other formats quite some time is spent on preparing the test data. Could we not just continue with the old format? Or keep the old data as a single CSV but then allow to add new names via the new v2 matching response maybe?

djtfmartin commented 2 months ago

My main concern was leaving a bunch of files that don't make too much sense for future developers maintaining the API as i assume v1 responses will eventually be deprecated and eventually be unavailable. Also the content of v2 responses wont fit into v1 (additional ranks, string keys etc).

For now I'll keep the existing files as they are and try and make it clear in the code its a format for v1 responses. For the new API, we can follow the same model of using the matching response format in v2 to build the index and keep the v2 responses in separate directory.