Open rjkerrison opened 4 years ago
@m-doughty @StuartHarris The current behaviour of excluding non-ascii substances is taken from the manual importer. What was the reason behind this? Should we continue to exclude non-ascii substances?
If I remember correctly, non-asci characters are not allowed in the metadata fields, but you should confirm this.
Have confirmed that something doesn't like non-ASCII characters. Attempting non-ASCII characters causes the following error:
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /Users/craiga/.cargo/git/checkouts/azuresdkforrust-0886d4f3ee80caf0/ce612fd/azure_sdk_storage_core/src/rest_client.rs:415:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Not sure if this has to do with the Azure crate or Azure itself.
Looks like HTTP headers aren't supposed to contain non-ASCII characters, and Hyper enforces this.
Stack Overflow seems to suggest that such strings should be URL encoded for Azure, but I wasn't able to find any definitive documentation about this in my brief search.
@StuartHarris found this which may be very helpful here https://docs.microsoft.com/en-us/azure/search/search-indexer-field-mappings#base64DecodeFunction
Note from Aisling: No active substance should have any accents, as only English language should be used for active substances in the U.K. If you have list of the affected active substances that would be super, as this is something that we would need to remedy rapidly in affected case folders to remove this as an issue. - As far as I know we haven't come across any real world examples of any of these characters being used? So effectively this isn't needed.
Verified that active substance names are parsed as expected except for when they contain a non-ASCII character.
For example, in the example posted above, both "SUBSTANCE A" and "SUBSTANCE B" were attached to the product, but "ZOË" and "CAFÉ" were not.
Is there a chance that active substances will contain characters outside of the standard ASCII character set?
Originally posted by @craiga in https://github.com/MHRA/products/issues/426#issuecomment-602747805
Expected Behaviour
An active substance with an accented character in its name should be allowed.