AtlasOfLivingAustralia / specieslist-webapp

Species lists and traits tool
https://lists.ala.org.au
Mozilla Public License 2.0
6 stars 21 forks source link

Difficult names match to higher-order taxa #235

Open charvolant opened 1 year ago

charvolant commented 1 year ago

For example,

https://lists.ala.org.au/speciesListItem/list/dr884?q=Caladenia+dilatata

where Caladenia dilatata is matched to Caladenia. The taxonomy of C. dilatata is complex and messy, with many misapplications, resulting in the higher order match.

Suggested fix: allow exact match parameters to be passed to the namematching service, with it choosing accepted taxa over synonyms if there are multiple possibilities.

charvolant commented 1 year ago

All SA examples have a supplied name of "Arachnorchis dilatata (R.Br.) D.L.Jones & M.A.Clem." Check that the SDS is correctly identifying these names.

charvolant commented 1 year ago

Search in namematching library returns 10 results. There are 17 results, only one of which is accepted, which gets left off.

Accepted value has a lower score than the misapplications 6.9 vs 7.7. Explain, please.

Misapplied score:

7.71838 sum of:
  4.915918 weight(name:caladenia dilatata in 597110) [BM25Similarity], result of:
    4.915918 score(freq=1.0), product of:
      10.559289 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        17 n, number of documents containing term
        674339 N, total number of documents with field
      0.46555385 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 freq, occurrences of term within document
        1.2 k1, term saturation parameter
        0.75 b, length normalization parameter
        2.0 dl, length of field
        2.1226935 avgdl, average length of field
  2.8024619 weight(genus:caladenia in 597110) [BM25Similarity], result of:
    2.8024619 score(freq=1.0), product of:
      6.1654162 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        1299 n, number of documents containing term
        618560 N, total number of documents with field
      0.45454544 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 freq, occurrences of term within document
        1.2 k1, term saturation parameter
        0.75 b, length normalization parameter
        1.0 dl, length of field
        1.0 avgdl, average length of field

Accepted score

6.9079895 sum of:
  4.1055274 weight(name:caladenia dilatata in 329202) [BM25Similarity], result of:
    4.1055274 score(freq=1.0), product of:
      10.559289 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        17 n, number of documents containing term
        674339 N, total number of documents with field
      0.38880718 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 freq, occurrences of term within document
        1.2 k1, term saturation parameter
        0.75 b, length normalization parameter
        3.0 dl, length of field
        2.1226935 avgdl, average length of field
  2.8024619 weight(genus:caladenia in 329202) [BM25Similarity], result of:
    2.8024619 score(freq=1.0), product of:
      6.1654162 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        1299 n, number of documents containing term
        618560 N, total number of documents with field
      0.45454544 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 freq, occurrences of term within document
        1.2 k1, term saturation parameter
        0.75 b, length normalization parameter
        1.0 dl, length of field
        1.0 avgdl, average length of field

The key element here is dl / avgdl in the name field. The accepted document has two entries Caladenia dilatata and Caladenia dilatata Caladenia dilatata R.Br. (what?!) The synonym document has just Caladenia dilatata - dl is document length (field length for a specific field, really) and avgdl is the average document length in the corpus.

charvolant commented 1 year ago

Complete name Caladenia dilatata R.Br. is correctly supplied in combined-20210811-4 which means that the erroneous name is being constructed during index creation.