Mapper Annotated Text - Escape string that should not be considered as annotation

ls-pluriels commented 4 years ago

I am indexing data coming from several sources. We use the Mapper Annotated Text plugin to enrich our output with our NLP tags.

In this example, the bold text contains a string that meet markdown syntax but should not be used as annotation.

In 2010, 57% of ERCOTs generation capacity and 38% of [electricity production](INDICATOR_571659&INDUSTRY6349) came from [natural gas]_(INDUSTRY_522125)[49](The Electric Reliability Council of Texas covers 75% of the state geographically and 80% of the population).

I discovered this because the % caused an exception : { "error": { "root_cause": [ { "type": "illegal_argument_exception", "reason": "URLDecoder: Illegal hex characters in escape (%) pattern - Error at index 0 in: \" o\"" } ], "type": "illegal_argument_exception", "reason": "URLDecoder: Illegal hex characters in escape (%) pattern - Error at index 0 in: \" o\"" }, "status": 400 }

It appears that without causing an exception, I could easily index wrong annotations

elasticmachine commented 4 years ago

Pinging @elastic/es-search (:Search/Analysis)

nik9000 commented 4 years ago

@markharwood, is this something you know about? I'm not 100% sure what we should do about this.

markharwood commented 4 years ago

We have a formal syntax for the format of this field that is validated - when content is presented that is illegal I can think of two options. Either we: A) Relax any validation - a lenient flag in the mapping could silently ignore illegal annotation syntax or B) Allow clients to escape any original content that looks like an annotation but isn't e.g. we might ignore a [ character if the client put a \ in front of the offending [

Neither of these are great. A) means turning off all validation so things that are meant to be annotations will fail silently if there are formatting issues B) complicates the syntax, is probably some data preparation users are unwilling to do and gives us a backwards compatibility issue with any existing content that happens to contain our new choice for escape characters.

I think the current system is the best one - prior to annotating, clients must cleanse the text of any characters that look like illegal markdown URLs. This could be as unobtrusive as inserting a space between any ]( characters or replacing ] characters with the equivalent html character reference ] or whatever makes sense to your application. At the end of the day something has to render the document and make sense of our annotations and any original text that might conflict with that markup - I don't think we should prescribe how this separation should be encoded.

Do you have any thoughts on this? Otherwise I will close.

ls-pluriels commented 4 years ago

HI Mark, thank you for handling this request. I did implement the cleansing by inserting a space between any ]( that was the most obvious solution to preserve positions

From my perspective, it was difficult to pinpoint the problem. I also noticed that there was an error because of the % char that was in the original text. Silently, non desired annotations were indexed without causing any error.

Maybe I could at least add some warnings in the documentation ?

I like your explanation : prior to annotating, clients must cleanse the text of any characters that look like illegal markdown URLs.

markharwood commented 4 years ago

Maybe I could at least add some warnings in the documentation ?

The "prior to annotating" part of the advice makes it feel like it's more of an upstream concern e.g. something for an NLP pipeline to consider.

ls-pluriels commented 4 years ago

After reading the document once more, it is really straight forward. I will perhaps write a blog post on our use case and highlight what may happen.

Thank you

markharwood commented 4 years ago

I will perhaps write a blog post on our use case and highlight what may happen.

Sounds great. I'd love to see that. Thank you.

elastic / elasticsearch

Mapper Annotated Text - Escape string that should not be considered as annotation #57347