AtlasOfLivingAustralia / biocache-service

Occurrence & mapping webservices
https://biocache-ws.ala.org.au/ws/
Other
9 stars 26 forks source link

Fullt-text `text` field should be documented in `index/fields` #709

Closed nickdos closed 9 months ago

nickdos commented 2 years ago

https://biocache.ala.org.au/ws/index/fields does not include SOLR fields that are copyTo fields, such as text. Also, there is no way for a user to know what fields are added to text.

brucehyslop commented 2 years ago

Merging the SOLR schema defined fields would also have the additional advantage of listing all fields that have no values defined for any records within the shard. Currently only fields that exist in the interrogated shard are listed.

I suggest adding a new sourceFields property to IndexFieldDTO which is a list of the source fields copied.

The solr schema returns this data:

http://nci3-solr-1.ala:8983/solr/biocache/admin/luke?show=schema
...
"text": {
  "type": "textgen",
  "flags": "IT--UM------------",
  "positionIncrementGap": 100,
  "copyDests": [ ],
  "copySources": [
    "country",
    "eventID",
    "speciesSubgroup",
    "raw_typeStatus",
    "scientificName",
    "raw_countryConservation",
    "raw_stateConservation",
    "collectionName",
    "catalogNumber",
    "basisOfRecord",
    "vernacularName",
    "raw_scientificName",
    "institutionCode",
    "raw_vernacularName",
    "id",
    "class",
    "countryConservation",
    "order",
    "dataResourceName",
    "raw_basisOfRecord",
    "typeStatus",
    "stateProvince",
    "collectionCode",
    "occurrenceID",
    "kingdom",
    "recordedBy",
    "phylum",
    "dataResourceUid",
    "genus",
    "species",
    "dataProviderName",
    "locationID",
    "institutionName",
    "originalNameUsage",
    "speciesGroup",
    "stateConservation",
    "family"
]
}
brucehyslop commented 2 years ago

@nickdos, the text field is not in the index field list because it is part of the default index.fields.tohide config.

_version_,text_recordedBy,defaultValuesUsed,generalisationToApplyInMetres,occurrenceDetails,text,quad these fields are filtered out be default. We could remove it from the default or override with a config change.

nickdos commented 2 years ago

Wasn't aware of that. I think all fields should be there, so users know they can be used. We should only exclude fields that are never exposed to users (internally used only), is my feeling.

brucehyslop commented 2 years ago

Wasn't aware of that. I think all fields should be there, so users know they can be used. We should only exclude fields that are never exposed to users (internally used only), is my feeling.

These field (and others) are not exposed to the user when returning an occurrence response however they are searchable and you can retrieve facets on them.

I'll remove everything except _version_ from the default index.fields.tohide. The can alway be added back in biocache-config.properties if needed.

Note: removing the hidden fields will add a small extra overhead in the field request and returned from SOLR since the cached index field list is used internally by biocache-service to define the fl parameter sent to SOLR if fields is not defined a the biocache-service request. Since the SOLR result objects are processed into a biocache DTO rendered as a result there should be no change to the biocache-service API results.

brucehyslop commented 2 years ago

All fields have been removed from the default index.fields.tohide and will be returned by index/fields. A new optional property sourceFields containing an array of field names that are copied to a SOLR field has been added to the index field object.

see: https://biocache-ws-test.ala.org.au/ws/index/fields?dataType=textgen all textgen fields are copied from other fields.

nickdos commented 2 years ago

Thanks @brucehyslop that looks really good.

adam-collins commented 9 months ago

appears done