Add Support for Escaped Characters

bripkens / lucene

Node.js lib to transform: lucene query → syntax tree → lucene query

MIT License

72 stars 33 forks source link

Add Support for Escaped Characters #16

Closed Ironykins closed 6 years ago

Ironykins commented 6 years ago

This involves a small tweak to the grammar such that certain backslash-prepended characters get added as-is into strings in the AST, rather than parsed and turned into properties.

Since this changes how the AST internally stores its strings, it's possible that users who are parsing strings directly out of the AST may experience breaking changes, but it is very very unlikely. This should be backwards compatible in all common use cases, so I've only bumped the minor version number.

I've added some new unit tests, and old unit tests are almost entirely intact, except in one case where I had to compare against an escaped character.

coveralls commented 6 years ago

Coverage increased (+3.5%) to 83.878% when pulling 9480db2d41cc4b7a9078c05c3817861543e26995 on Ironykins:master into be5c450a3ec9d9d8d4bb96a20c4e974202676d73 on bripkens:master.

lucecc commented 6 years ago

I have a question: I execute the following query "field: value \: *" through the library I get error does this fix (Candidate 1.4) solve this kind of problem that I have described to you?

bripkens commented 6 years ago

@lucecc: No, this will not permit parsing of field: value \: *, as this is not a valid lucene query (validated with the official Java based lucene query parser).

@Ironykins: While I like the idea of keeping track of the exact user input, I don't really like that this is a breaking change. Would it be possible for you to change this PR such that.

The user escaped string goes into a separate field into the AST, e.g. rawTerm?
That the term field is unchanged?

This way it wouldn't be a breaking change and you would still be able to access the raw input of the user.

bripkens commented 6 years ago

On second thought, I now understand more clearly where you are coming from. Let me revisit this again.

bripkens commented 6 years ago

Ignore my ignorant comment (https://github.com/bripkens/lucene/pull/16#issuecomment-375739422). I am taking a deeper look at your changes right now and will manually merge them to master with some smaller teaks. As mentioned before, I didn't understand at first why you were doing this, but it completely makes sense now :).

bripkens commented 6 years ago

I picked your changes to master in https://github.com/bripkens/lucene/commit/e39a772ff6267515f828200d0f4bc5bacf9ae78d. Also, I made some additional changes such that we clearly indicate and handle the breaking change (and to provide helpers to account for these breaking changes). Once the build is done, I'll release it as 2.0.0.

Thank you very much for this PR!

bripkens commented 6 years ago

Released as 2.0.0: https://github.com/bripkens/lucene/blob/master/CHANGELOG.md#200

Ironykins commented 6 years ago

Thanks very much! I love what you're doing with this project.

raicast commented 6 years ago

@bripkens I do not understand if the release 2.0.0 supports the Apache Lucene specification: https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

In particular, the Escaping Special Characters section: Lucene supports escaping special characters that are part of the query syntax. The current list special characters are

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \

To escape these character use the \ before the character. For example to search for (1+1):2 use the query:

\(1\+1\)\:2

bripkens commented 6 years ago

@raicast: If this is a bug report, then please open a separate ticket.

This being said:

> require('lucene').parse('\\(1\\+1\\)\\:2')
{ left:
   { field: '<implicit>',
     fieldLocation: null,
     term: '\\(1\\+1\\)\\:2',
     quoted: false,
     termLocation: { start: [Object], end: [Object] },
     similarity: null,
     boost: null,
     prefix: null } }