MHRA / products

Products Information Portal and Microservices
MIT License
11 stars 9 forks source link

Active substance names are excluded when they contain non-ASCII characters #551

Open rjkerrison opened 4 years ago

rjkerrison commented 4 years ago

Verified that active substance names are parsed as expected except for when they contain a non-ASCII character.

For example, in the example posted above, both "SUBSTANCE A" and "SUBSTANCE B" were attached to the product, but "ZOË" and "CAFÉ" were not.

Is there a chance that active substances will contain characters outside of the standard ASCII character set?

Originally posted by @craiga in https://github.com/MHRA/products/issues/426#issuecomment-602747805

Expected Behaviour

An active substance with an accented character in its name should be allowed.

rjkerrison commented 4 years ago

@m-doughty @StuartHarris The current behaviour of excluding non-ascii substances is taken from the manual importer. What was the reason behind this? Should we continue to exclude non-ascii substances?

StuartHarris commented 4 years ago

If I remember correctly, non-asci characters are not allowed in the metadata fields, but you should confirm this.

craiga commented 4 years ago

Have confirmed that something doesn't like non-ASCII characters. Attempting non-ASCII characters causes the following error:

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /Users/craiga/.cargo/git/checkouts/azuresdkforrust-0886d4f3ee80caf0/ce612fd/azure_sdk_storage_core/src/rest_client.rs:415:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Not sure if this has to do with the Azure crate or Azure itself.

craiga commented 4 years ago

Looks like HTTP headers aren't supposed to contain non-ASCII characters, and Hyper enforces this.

Stack Overflow seems to suggest that such strings should be URL encoded for Azure, but I wasn't able to find any definitive documentation about this in my brief search.

craiga commented 4 years ago

@StuartHarris found this which may be very helpful here https://docs.microsoft.com/en-us/azure/search/search-indexer-field-mappings#base64DecodeFunction

KaylieGreen1607 commented 4 years ago

Note from Aisling: No active substance should have any accents, as only English language should be used for active substances in the U.K. If you have list of the affected active substances that would be super, as this is something that we would need to remedy rapidly in affected case folders to remove this as an issue. - As far as I know we haven't come across any real world examples of any of these characters being used? So effectively this isn't needed.