cioos-siooc / ckan

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers datahub.io, catalog.data.gov and europeandataportal.eu/data/en/dataset among many other sites.
http://ckan.org/
Other
2 stars 4 forks source link

indexing of french vs english content #201

Closed fostermh closed 9 months ago

fostermh commented 11 months ago

the current solr indexing treats most fields as either string or text type. If treated as a general text field, solr strips out special characters and tokenizes on spaces. The result is n'est is indexed as nest and thus several datasets are returned that should not be in ckan searches. There are likely many other examples of this. If we implement tokenizing/cleaning for English and French fields separately n'est would be dropped from the index much like and or the are removed from English indexed fields. This would be good.

Alternatively, we can use the string field type which indexes the whole contents of the field without tokenizing. This works for things like keywords, eov's, or organization names but is not appropriate for large text fields such as descriptions.