Open louisecpage opened 4 years ago
@markmacgillivray when you've discussed this on other projects has it included people whose native languages use special characters and what has been the outcome of the discussions on other projects?
Would fuzzy matching operate within the autocomplete suggestions and incorporate mistypes and special character usage (or not) within the results returned?
@hjh33 I've provided similar technical points but haven't been involved in the decisions of those projects - one of them was DOAJ though, I think, so @richard-jones may know what they chose for their user base, which does have some users with native languages that contain special characters.
About fuzzy matching, there's an issue that already relates to it, #36 which is in future requirements. I'd need to check if fuzzy matches would incorporate special characters, but I believe it would.
Note, there will always be some limit to what "special characters" we can match. If we are talking about things like umlauts in human languages, that sort of thing should be fine. But if we are talking about things like special mathematical characters that certain users may use in the titles of things, we probably can't match those, because they usually get mangled during the ingest process before reaching us, e.g. crossref etc don't get them accurately from the publishers that submit them. However, this is something I've seen at article title level rather than at journal title level, but I'd guess there could be some obscure mathematical journal in the world somewhere that uses special characters that we will never be able to match.
I think the best solution for now is that when we know a journal title contains special characters (because we receive it that way from crossref for example) then we will store it that way. If a user then searches with the special characters, we will match it. Also, if the UI supports autocompletion then while the user is typing they will see possible titles appear anyway, and this may aid in them being able to select the correct one (and this also aids in minimising typing/spelling mistakes). Later, in future requirements, we can also add fuzzy matching if feedback from the first release indicates that users are too often having trouble finding the correct journal. Does that sound suitable to you?
@markmacgillivray that seems like a logical suggestion. Would be good to know @richard-jones what DOAJ went with.
@paulwalk @richard-jones I've discussed this on other projects before, there are some things to consider as it is not a straightforward decision. Supporting search results for special characters is not hard in itself, and neither is supporting search without special characters, but supporting either OR both requires a decision about what experience is intended for the end user, and what the service wants to deliver. See below:
in the default case, we store the journal name that we find in the world, e.g. whatever it is on crossref, or other sources we pull from. Whatever that journal name is, is the one a user could find by typing it in. If it contains an umlaut for example, and the user types an umlaut, they would get back that journal name in an autocomplete. For users who do use such languages, this would be the best match for what the user actually wants. But for users who do not use that language and who type the name without an umlaut, they would not get the result back (because, really, they are mis-spelling the title).
a user who does use special characters, for example in a language that uses umlauts in journal title names, and using a keyboard that supports those characters (or by manually inserting them) would not necessarily expect to see results that do not include those special characters - e.g. if the journal name contains a special character and the user types that character, do they really want to be shown journal names that are NOT what they asked for?
if we do want to support users who might expect search results to return journals with an umlaut in the title even if they did not type an umlaut in the title, is supporting these users more important than providing a lower quality search experience for users who are typing the journal name correctly, as in the case described above?
if we do want to support matching on characters with umlauts (for example) and also on the same characters without the umlaut, then yes, we can do that, once a decision is made in relation to the caveats above. However, we do need to know in advance, as it requires extra pre-processing of the data. It is not a lot of extra work to do, but would need to be considered in scheduling.
keep in mind that even if we did do asciifold or similar to allow these sorts of matches, it is still possible that ANY user, whether typing languages with special characters or not, may mistype or misspell a journal title. So the result we really want for the user experience may better be supported by fuzzy matching regardless of character types, which is a query-time optimisation, rather than a data pre-processing task.