apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

Clean up obsolete and commented-out cruft from StandardSyntaxParser.jj [LUCENE-9528] #10568

Closed asfimport closed 4 years ago

asfimport commented 4 years ago

The indentation in that file is crazy. So are micro-optimizations which make reading the syntax parser much more difficult than it actually is.


Migrated from LUCENE-9528 by Dawid Weiss (@dweiss), resolved Sep 18 2020 Parent: #10566 Linked issues:

Pull requests: https://github.com/apache/lucene-solr/pull/1879

asfimport commented 4 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I do have one question about syntax. We currently allow boost and fuzzy operators to come in any order:

term^3\~2

The order above doesn't make any sense from parser grammar point of view (and is in fact coded incorrectly allowing two conflicting fuzzy operators). The boost applies to a sub-clause (be it regexp, term query, phrase query) and grammar-wise it shouldn't occur before the fuzzy operator.

I can implement this as it was but it makes parser grammar more difficult to read and is just plain unnatural (in my opinion). If there are no objections, I'd like to restrict query parsing to disallow the above ordering.

https://github.com/dweiss/lucene-solr/blob/LUCENE-9528/lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/parser/StandardSyntaxParser.jj#L398-L426

asfimport commented 4 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

This'll be an interesting test of whether the Gradle tasks that process the jj file works, let me know if it doesn't...

I don't know how often the javacc task(s) are run. I do know that we've seen multiple instances of people hand-editing the results of that task, it's very easy to overlook the comments at the top of the generated files that say DO NOT HAND EDIT. Particularly when you're trying to remove warnings, fix deprecations and the like.

Now I'm wondering if it makes sense to run this target as part of check. If people had hand-edited the output, it'd at least recreate the output with differences, then fail check because they weren't "git add"ed yet. It takes about 5 seconds to run that task.

It took me some time to make all of the hand-edits be done automatically, but the rule should be that if you hand-edit the results of that task, then something happens that gives you a clue you've done something wrong.

asfimport commented 4 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I cleaned up the gradle build file too (parts of it). It works fine.

I am hesitating whether we need to version these generated files at all. I see pros and cons. If people have strong feelings about versioning those files then we could at least have a dedicated separate src/generated folder to keep them (so that it's clear they're separate from main sources.

asfimport commented 4 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Theoretically I suppose some kind of versioning makes sense. OTOH, how many problems have we really had because it's not versioned? My guess is that it's not really worth the effort at this point, but I could be persuaded either way.

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 3a92e1b93ebce73ab148eecc31776dde07c3d56d in lucene-solr's branch refs/heads/master from Dawid Weiss https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3a92e1b

LUCENE-9528: cleanup of flexible query parser's grammar (#1879)

asfimport commented 4 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Just curious; are you (Dawid) improving the flexible query parser because you are using it, or ... ? I've been procrastinating proposing it's removal from Lucene. I've been on a couple big search projects where we took a look at it and chose not to use it, and I don't recommend to most people who need a query parser either.

asfimport commented 4 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I think this query parser is great, really! It does have some things that could be cleaned up but it is the only query parser that is externally extensible and usable for me. It supports multifield expansion for unqualified terms and it is very flexible - the parse tree resulting from query parsing can be adjusted and enhanced in multiple ways without touching the parser code (you only touch the post processing tree pipeline and/or the builders).

We have been using it with success in both our in-house projects (Lingo4G) and in combination with Solr. I don't see the reason to remove it. In fact, I think the approach this query parser takes (intermediate representation) is much superior to the "classic" query parser which creates intermediate Query instances right away. And the span query parser... eh. This isn't usable at all - the syntax is too strange - it was difficult to use even for me as a developer.

asfimport commented 4 years ago

David Smiley (@dsmiley) (migrated from JIRA)

What attracted both search projects I was a part of to this design, apart from me showing it to my clients/employers :-), is the flexible design and separation of query syntax specification from Lucene Queries. Indeed, that part is excellent! Both projects ultimately wrote new code for themselves that borrowed that same winning philosophy, and the choice not to use this flexible query parser was in neither case mine. In both cases, what turned my peers off from it is its complexity (I recall too many classes, and other reasons) and age (didn't use new Java language features/techniques and choice of JavaCC was dubious relative to Antlr), and I sympathize a lot. I haven't looked at it in years now but I recall having a constant itch to either simplify it and/or clean it up when looking at the old code, which isn't a good sign (to me). I was imagining a totally revamped flexible query parser built on Antlr4.

@rjernst maybe you might have some opinions on this given I've seen you work on 2-3 parsers in Lucene.

asfimport commented 4 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I don't think that grammar in javacc is too bad, to be honest - not after a minor cleanup (and me getting used to some javacc concepts). The problem with Antlr is that it adds a dependency on its runtime, which kind of sucks.

I honestly don't care much what it's written in as long as it works. :) Parboiled is kind of fun too [1], for example.

[1] https://en.wikipedia.org/wiki/Parboiled_(Java)

asfimport commented 4 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We already have antlr used by expressions (and in Elasticsearch there's also Painless, @rmuir and my masterpiece of highly performant invokedynamic usage and native lambda support still not statically typed). So I have no problem with it.

Parboiled is funny, I used it in earlier times with pegdown, the precedessor of flexmark. The problem with pegdown was some design problems leading to slowness.

asfimport commented 4 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Which makes me think that I don't think we currently have antlr sources regeneration in gradle...

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

My only comment on antlr is that newer versions are a bit trappy: they happily (but very slowly) parse stuff when there are crazy ambiguities, which older versions would not allow and detect statically.

With painless the only way I found to prevent such performance traps was to write unit tests for everything, and setup some special antlr magic to fail tests rather than do something slow. This way you know the grammar needs to be fixed instead of having performance issues:

https://github.com/elastic/elasticsearch/blob/9497f66aacef7193d343259ff3c1273b9c07456a/modules/lang-painless/src/main/java/org/elasticsearch/painless/antlr/Walker.java#L211-L227

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Their explanation: https://github.com/antlr/antlr4/blob/master/doc/faq/general.md#what-is-the-difference-between-antlr-3-and-4

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release