apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Massive Code Duplication in Contrib Analyzers - unifly the analyzer ctors [LUCENE-2034] #3109

Closed asfimport closed 14 years ago

asfimport commented 14 years ago

Due to the variouse tokenStream APIs we had in lucene analyzer subclasses need to implement at least one of the methodes returning a tokenStream. When you look at the code it appears to be almost identical if both are implemented in the same analyzer. Each analyzer defnes the same inner class (SavedStreams) which is unnecessary. In contrib almost every analyzer uses stopwords and each of them creates his own way of loading them or defines a large number of ctors to load stopwords from a file, set, arrays etc.. those ctors should be removed / deprecated and eventually removed.


Migrated from LUCENE-2034 by Simon Willnauer (@s1monw), resolved Jan 03 2010 Attachments: LUCENE-2034,patch (versions: 2), LUCENE-2034.patch (versions: 9), LUCENE-2034.txt Linked issues:

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Attached a patch with a base class that solves the code duplication issue and simplifies stopword loading for contrib analyzers. The patch contains 3 analyzers which use this new base class

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

btw. if somebody comes up with a better name for the analyzer speak up! @robert: no super, fast or smart please :)

simon

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I have removed a sys.out I had accidentally in there. The new patch contains more converted analyzers but I guess we should first get this in before we start converting all analyzers in contrib. Pretty neat though stuff like ChineseAnalyzer only has one method in it after inheriting BaseAnalyzer

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I changed the name from BaseAnalyzer to AbstractContribAnalyzer that might do a better job than BaseAnalyzer.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

An then we also have an AbstractCoreAnalyzer? weird...

I want to bring this to core, too.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

An then we also have an AbstractCoreAnalyzer? weird... I want to bring this to core, too.

I understand! Lets get this into contrib with either one name. Once we move this up in core it will be called Analyzer anyway so we can refactor it in contrib easily. The name AbstractContribAnalyzer would than again be ok as it would only contain the stopword convenience.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Even in core it will be a separate class, because it makes tokenStream() and reusableTokenStream() final, so users want to create an old style Analyzer cannot do this. So we need a good name even for core.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I would be happy to get a better name for it - any suggestions - I'm having a hard time to find one.

its your turn uwe :)

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I externalized the stopword related code into another base analyzer and named the analyzer AbstractAnalyzer and StopawareAnalyzer.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Simon, i started looking at this, the testStemExclusionTable( for BrazilianAnalyzer is actually not related to stopwords and should not be changed.

BrazilianAnalyzer has a .setStemExclusionTable() method that allows you to supply a set of words that should not be stemmed.

This test is to ensure that if you change the stem exclusion table with this method, that reusableTokenStream will force the creation of a new BrazilianStemFilter with this modified exclusion table so that it will take effect immediately, the way it did with .tokenStream() before this analyzer supported reusableTokenStream()

<edit, addition> also, i think this setStemExclusionTable stuff is really unrelated to your patch, but a reuse challenge in at least this analyzer. one way to solve it would be to:

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

the testStemExclusionTable( for BrazilianAnalyzer is actually not related to stopwords and should not be changed.

I agree, I missed to extend the testcase instead I changed it to test the constructor only. I will extend it instead. This testcase is actually a duplicate of testExclusionTableReuse(), it should test tokenStream instead of reusableTokenStream() - will fix this too.

This test is to ensure that if you change the stem exclusion table with this method, that reusableTokenStream will force the creation of a new BrazilianStemFilter with this modified exclusion table so that it will take effect immediately, the way it did with .tokenStream() before this analyzer supported reusableTokenStream()

that is actually what testExclusionTableReuse() does.

also, i think this setStemExclusionTable stuff is really unrelated to your patch, but a reuse challenge in at least this analyzer. one way to solve it would be to...

I agree with your first point that this is kind of unrelated. I guess we should to that in a different issue while I think it is not that much of a deal as it does not change any functionality though. I disagree with the reuse challenge, in my opinion analyzers should be immutable thats why I deprecated those methods and added the set to the constructor. The problem with those setters is that you have to be in the same thread to change your set as this will only invalidate the cached version of a token stream hold in a ThreadLocal. The implementation is ambiguous and should go away. The analyzer itself can be shared but the behaviour is kind of unpredictable if you reset the set. If there is an instance of this analyzer around and you call the setter you would expect the analyzer to use the set from the very moment on you call the setter which is not always true.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Updated patch to current trunk (the massive patch with @Override broke the patch)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

simon, good solution. I agree we should deprecate these analyzer 'setter' methods, which just make things complicated for no good reason.

i wonder if we should consider a different name for this AbstractAnalyzer, since it exists to support/encourage tokenstream reuse. I think when Shai Erera brought the idea up before he proposed ReusableAnalyzer or something like that?

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

i wonder if we should consider a different name for this AbstractAnalyzer, since it exists to support/encourage tokenstream reuse. I think when Shai Erera brought the idea up before he proposed ReusableAnalyzer or something like that?

I agree that this is not a very good name as we all discussed during apacheCon. With ReusableAnalyzer I would guess people would expect this analyzer to be reusable which isn't the case or rather is not what this is analyzer is doing. What if we call it ComponentAnalyzer or NewStyleAnalyzer or SmartAnalyzer (ok just kidding)

simon

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Pushed this to 3.1. We might want to change existing analyzers in core to use the new analyzer too and make then final. So we are better off with pushing it to 3.1

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Simon, here

source.reset(reader); if(sink != source) sink.reset(); // only reset if the sink reference is different from source

we had a discussion on the mailing list about this: http://www.lucidimagination.com/search/document/cd8a94ebc8a4ea99/bug_in_standardanalyzer_stopanalyzer

I think we should consider removing the if, and unconditionally call sink.reset(). A bad consumer might not follow the rules, although it says in TokenStream javadoc that consumers should call reset()..

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Updated the patch to the current trunk. I have not removed all the deprecated methods in contrib/analyzers yet - we should open another issue for that IMO. Yet this patch still brakes back compatibility as some of the none final contrib analyzers extend StopawareAnalyzer with makes the old tokenstream / reusableTokenstream methods final. IMO this should not block this issues for the following reasons:

  1. its in contrib - different story for core
  2. it is super easy to port them
  3. it make the API cleaner and has less code
  4. those analyzers might have to change anyway due to the deprecated methods

simon

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

set svn EOF property to native - missed that in the last patch

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

set svn EOF property to native - missed that in the last patch

You can cofigure your SVN client to do it automatically and also add the $ID$ props.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Simon in my opinion it is ok, about making tokenstream/reusablets final for those non-final contrib analyzers.

i think you should make those non-final analyzers final, too.

then we can get rid of complexity for sure.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

i think you should make those non-final analyzers final, too.

+1

I think the analyzers should always be final. Maybe there are special cases but for the most of them nobody should subclass. Same amount of work to make your own anyway.

simon

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

i think you should make those non-final analyzers final, too.

I would prefer to open a sep. issue for making those final & remove deprecated methods, make public String[] private etc.

Once this in in we can refactor all other analyzers and fix them case by case.

simon

asfimport commented 14 years ago

DM Smith (migrated from JIRA)

I was trying to lurk, but I'm not able to apply the latest patch against trunk. I'm not sure if its me (using Eclipse) or the patch.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I was trying to lurk, but I'm not able to apply the latest patch against trunk. I'm not sure if its me (using Eclipse) or the patch.

its most likely the patch. There is so much going on around the analyzers right now. We try to get #3170 in and get this ready once it is in. I will update this patch soon.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I updated this patch to the latest trunk. The patch doesn't remove any deprecated methods from contrib/analysis neither does it mark the other Analyzers final. I think we should do all that in a different issue. I haven't added a note to contrib/CHANGES.TXT yet while it already breaks bw. compat for all none-final analyzers subclassing AbstractAnalyzer / StopawareAnalyzer. Once we have a consensus on this patch I will add it.

asfimport commented 14 years ago

DM Smith (migrated from JIRA)

Patch looks good. I like how this simplifies the classes.

Some comments based on my use case, which allows a user creating an index to decide whether to use Lucene's default stop words or no stop words at all. No stop words is the default. (I'm also allowing stemming to be optional, but on by default.) These two require me to duplicate the each contrib Analyzers but reuse the parts. (If you're interested, each Lucene index is a whole book, where each paragraph is a document. Every word is potentially meaningful so stop words are not used by default.)

Regarding stop words:

Regarding 3.1: There are some TODOs in the code to make this or that private or final. If this is going to wait for 3.1 shouldn't they change?

On a separate note: In WordListLoader the return types are not Set or Map, but HashSet and HashMap. What's up with that? Should anyone care what the particular implementation is?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi DM, in response to your comments, I would prefer all stoplists to actually be in the resources/* folder as text files.

The reasoning is to encourage use of the different parts of the analyzer, i.e. a Solr user can specify to use a russian stopword list embedded in a russian analyzer, without using the analyzer itself (maybe they want to use the stemmer but after WordDelimiterFilter and things like that).

somewhat related: I also want to add the stoplists that the snowball project creates to the snowball package in contrib: see #3131 This would allow us to remove duplicated functionality, analyzers we have coded in java in lucene that are essentially the same as what snowball does already.

asfimport commented 14 years ago

DM Smith (migrated from JIRA)

Robert, I'd like them to be in files as well. But when it really gets down to it, a uniform interface to the the default stop word list is what really matters to me.

Like your use case, I don't see the provided analyzers as much more than a suggestion and default implementation. Currently and in this patch, I have to use them to get to the stop words.

I'm trying to figure out a way to specify a tokenizer/filter chain. (I've been trying to figure it out for a while, but not with much effort or success). Something like:

TokenStream construct(Version v, String fieldName, Reader r, StreamSpec ...) {
  source = first StreamSpec.create(v, fieldName, r);
  result = source;
  for the remaining StreamSpec {
     result = streamSpec.create(v, fieldName, result);
  }
  return result;
}

The purpose of the StreamSpec is to allow a late binding of tokenizers/filters into a chain.

The other part would be to generate a Manifest with version info for Lucene, Java and each component that could be stored in (or with) the index. That way one could compare the manifest to see if the index needs to be rebuilt. This manifest could also be used to reconstruct the TokenStream.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

On a separate note: In WordListLoader the return types are not Set or Map, but HashSet and HashMap. What's up with that? Should anyone care what the particular implementation is?

That's historical. For 2.9 it was not possible to provide the method covariant with different return type for BW compatibility, so the old ones could not be deprecated. With 3.0 they stayed alive and now there they are.

With Java 1.5, there should be the possibility to provide an covariant overload and deprecate the specializations. I will try out in a separate issue!

Ideally he new methods should return Set<?> but implement this by a CharArraySet (which would be possible then). At the moment the sets are always copied to CharArraySet in each Analyzer.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Robert, I'd like them to be in files as well. But when it really gets down to it, a uniform interface to the the default stop word list is what really matters to me.

DM, I think we can have both? A method to get the default stopword list, but then they also happen to be in text files too?

asfimport commented 14 years ago

DM Smith (migrated from JIRA)

Robert:

DM, I think we can have both? A method to get the default stopword list, but then they also happen to be in text files too?

Yes.

Uwe:

Ideally he new methods should return Set<?> but implement this by a CharArraySet (which would be possible then). At the moment the sets are always copied to CharArraySet in each Analyzer.

I agree. That could also simplify some of what Simon is doing. However, the one distinctive of CharArraySet is that it can take input that is not lowercase and ignore the casing. This is what Simon's StopawareAnalyzer.loadStopwordSet(...) allows.

BTW, in some of the analyzers sometimes it is a CharArraySet and other times it is not (when it is via this class). This would make the treatment uniform.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Some of the analyzers allow for null to be specified for the stop word list. Others require an empty set/file/reader. Those deriving from StopawareAnalyzer allow null.

That is true - Stopawareanalyzer uses an empty set if you pass null.

I'd like to see the ability to use null to follow through the rest of the analyzers.

*Some of the analyzers are cluttered with stopword list processing. The analyzers in this patch are rather a PoC than a complete list. Eventually we will have all analyzers with stopwords to extend StopawareAnalyzer that is also the reason why we have this class. This and some other issues aim to eventually have a consistent way of processing all this stuff related to stopwords. We will also remove all the setters and have Set<?> only ctors for consistency.

If not how about adding public static Set<?> getDefaultStopSet() to StopawareAnalyzer?

the problem is that it is static and it should be static. Thats why we define it in each analyzer that uses stopwords. I would like to have it generalized but this seems to be the ideal solution. We could have something like a getDefaultStopSet(Class<? extends StopawareAnalyzer>) but I like the expressiveness of getDefaultStopSet() way better though.

How about splitting out the stop words to their own class?

What do you mean by that? can you elaborate?

There are some TODOs in the code to make this or that private or final. If this is going to wait for 3.1 shouldn't they change?

The should actually go away but I kept them in there because they are somewhat unrelated to this particular issue. Once this is in we will work on removing the deprecated stuff and make analyzers final (at least in contrib).

In WordListLoader the return types are not Set or Map, but HashSet and HashMap. What's up with that? Should anyone care what the particular implementation is?

that is one thing I hate about WordListLoader. +1 towards Uwe working on them!

I'm trying to figure out a way to specify a tokenizer/filter chain. (I've been trying to figure it out for a while, but not with much effort or success).

This has been discussed already and we haven't had much of a success though. I can not remember the issue (robert can you remember the factory issue?) but it was basically based on a factory pattern. This would also be my approach to it. That way we could get rid of almost every analyzer. I use such a pattern myself which works quite well.

DM, I think we can have both? A method to get the default stopword list, but then they also happen to be in text files too?

+1 for having those words in files. Nevertheless we will have a default stopword list though.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This has been discussed already and we haven't had much of a success though. I can not remember the issue (robert can you remember the factory issue?) but it was basically based on a factory pattern. This would also be my approach to it. That way we could get rid of almost every analyzer. I use such a pattern myself which works quite well.

I think the only issue is that if I were to design such a thing, it would look just like how the analysis factories work in Solr... (already a solved problem)... maybe I am missing something?

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I think the only issue is that if I were to design such a thing, it would look just like how the analysis factories work in Solr... (already a solved problem)... maybe I am missing something?

no you don't. I just did that when there where no solr around. works pretty much the same way though.

asfimport commented 14 years ago

DM Smith (migrated from JIRA)

How about splitting out the stop words to their own class?

What do you mean by that? can you elaborate?

There are several parts of this.

Most of the analyzers with stop words allow override with any of these and sometimes throw something else in the mix (such as non-utf8 encoded files).

The code to handle these cases is somewhat repetitious.

My thought is for a class, say StopWords, that knows how to read stopwords.txt as a resource loaded by a specified class loader. Something like:

public class StopWords {

  protected static final String       DEFAULT_STOPFILE = "stopfile.txt";
  protected static final String       DEFAULT_COMMENT  = "#";
  private          final Version      matchVersion;
  private                CharSetArray defaultStopWords;

  public StopWords(Version matchVersion, String stopFile, String comment, boolean ignoreCase) {
    this.matchVersion = matchVersion;
    this.ignoreCase   = ignoreCase;
    this.stopFile     = stopFile != null ? stopFile : DEFAULT_STOPFILE;
    this.comment      = comment  != null ? comment  : DEFAULT_STOPFILE;
  }

  public synchronized Set<?> getDefaultStopWords() {
    // lazy loading
    if (defaultStopWords == null) {
      defaultStopWords = load();
    }

    return defaultStopWords;
  }

  protected Set<?> load() {
    final Reader reader = new BufferedReader(new InputStreamReader(this.class.getResourceAsStream(stopFile), "UTF-8"));
    final CharSetArray result = new CharSetArray(matchVersion, 0, ignoreCase);
    try {
      for (String word = reader.readLine(); word != null; word = reader.readLine()) {
        if (!word.startsWith(comment)) {
          result.add(word.trim());
        }
      }
      return CharSetArray.unmodifiableSet(result);
    } finally {
      reader.close();
    }
  }

}

I'm pretty sure that this.class resolves to the class of the actual object and not the class in which it is called (as long as it is not called within the ctor).

Then in o.a.l.analysis.ar have:

public class ArabicStopWords extends StopWords {
  public ArabicStopWords(Version matchVersion) {
      super(matchVersion, null, null, false);
  }
}

Note that the arguments to super depend on the nature of the provided stop word list.

Additional code could be added to StopWords to handle resource as a Reader and as String[], but if we follow Robert's suggestion to externalize the list in a file it is not needed.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

DM, thanks for the extensive example. But I do not see the benefit compared to the current solution. To access a default stopword set you have to create an instance of a specific analyzer which is IMO not a very natural way. If you make it available statically in the analyzer it is simply equivalent to the StopawareAnalyzer solution that provides the loading code. We will always have to add a public static Set<?> getDefaultStopwords() method to each analyzer and this analyzer has to load the stopwords somehow.

I personally prefer the holder pattern as it is guaranteed to be lazy by the JVM. It is a simple declarative solution which requires developers to be consistent but this consistency is already required with the static getDefaultStopwords() method. - not really a win. Please correct me if I miss something.

asfimport commented 14 years ago

DM Smith (migrated from JIRA)

But I do not see the benefit compared to the current solution.

In an earlier post we discussed that it'd be possible, like SOLR, to eliminate analyzers for a factory pattern. The benefit of this variation (you are right, it is equivalent) is that it moves in that direction.

.bq To access a default stopword set you have to create an instance of a specific analyzer which is IMO not a very natural way. It could be made into a singleton (which would have been better in the first place), or static or both. I just tossed together one example, though extensive, to answer. Also, the matchVersion is not needed in the derived classes. So here is an alternate:

public class ArabicStopWords extends StopWords {
  private static final StopWords instance = new ArabicStopWords();
  private ArabicStopWords() {
    super(Version.LUCENE_30, null, null, false);
  }
  public static Set<?> getDefaultStopWords() {
    return instance.getDefaultStopWords();
  }
}

I personally prefer the holder pattern as it is guaranteed to be lazy by the JVM.

I'm not sure about this. I think this is a partially true statement. I know I could look it up to be sure. I thought that the JLS required all static initializers to be run at first access to the class. So if one does not want the list of default stopwords, but wants something else in the class or is supplying an alternate set of stopwords, the default stopwords are initialized anyway.

So the other benefit is that it is fully lazy. Though this is a small benefit.

On another note, still regarding code placement: StopFilter has a bunch of makeStopSet methods. WordListLoader has a few more. StopawareAnalyzer has another. My example has yet another. I think this creates confusion for end users and casual contributors as it is not clear how to proceed without looking at the code for examples. I'd like to see some kind of clarity/consolidation.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Im not sure about this. I think this is a partially true statement. I know I could look it up to be sure. I thought that the JLS required all static initializers to be run at first access to the class. So if one does not want the list of default stopwords, but wants something else in the class or is supplying an alternate set of stopwords, the default stopwords are initialized anyway.

DM, What you say its true but the holder is a static inner class and its static initializers run on the first access. That is right when it needs to be as it is only accessed once you the default stopwords. It does not require any synchronization as this is guaranteed by the JVM. What I like about it is that you can't introduce any synch. problems - simple and declarative.

So the other benefit is that it is fully lazy. Though this is a small benefit.

see above

It could be made into a singleton (which would have been better in the first place), or static or both. I just tossed together one example, though extensive, to answer. Also, the matchVersion is not needed in the derived classes.

It already is a singleton. the holder makes it a lazy loaded static final singleton. MatchVersion will only be needed in derived classes if the tokenStreamComponents

I personally don't like the various different ways you can load stopwords either, my approach is a different one. Stopwords are mainly used in analyzers / filters, we have a standard way to load them in StopawareAnalyzer if you implement your analyzer. If you use the analyzer you should use WordlistLoader. If we fix WordlistLoader to return Set<?> we are good to go with a single way for the user and a standard way for makeing a stopaware analyzer. If you wrap this up in a Class StopWords then people do not know what to do with it once they wanna load a Stem-Exclusion Table. Maybe I miss one important thing but I do not see the benefit of wrapping a Set<?> into another class. - If so please explain. :)

Thanks

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Updated the patch to the latest trunk.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I renamed AbstractAnalyzer to ReusableAnalyzerBase which reflects pretty much what it does. Yet the Base postfix is pretty common throughout the JDK similarly to the Abstract prefix. I added little more JavaDoc which brings some clarification when to subclass ReusableAnalyzerBase instead of Analyzer.

I guess this is ready to go in though.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Simon, the patch looks good to me.

I added a few things:

If no one objects, I will commit in a few days.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

robert, should we hold on one more time and move StopawareAnalyzer into core? As you suggested, StopwordAnalyzerBase would be a better name for it and way more consistent. That way we could implement StopAnalyzer with it too.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

ok, lets wait and discuss this issue instead.

i don't like how i can't use it for smartcn, etc. but is this just because of our build system/analyzers organization? or can we refactor things out of stopanalyzer, too?

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

robert, I updated your patch and moved stopawareAnalzyer to StopwordAnalyzerBase in to core. I also updated the CHANGES.TXT. THis will enable use to use it in smartcn too. StopAnalzyer now directly subclasses StopwordAnalyzerBase too. Seems to be way more consistent though.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Simon, thanks for the update, I like it.

I am going on vacation in a few weeks... so I can say, I will commit next year if no one objects :)

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

With #3245 this patch becomes even more valuable. The creation of Analyzers extending StopwordAnalyzerBase will be way faster than in previous versions, with this in mind we should add a pointer to StopwordAnalyzerBase that recommends using charArraySet for stopwords and in the next step we should get WordlistLoader refactored and deprecate all public HashSet * methods in favour of Set<?> with CharArraySet as an internal implementation. Unifiying the way to create / load stopwords outside of a StopwordAnalyzerBase subclass goes the same way I would guess. We need one way to do it though. I will create correspondent issues within the next days.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I am back on a real computer and (as mentioned december 18th) I would like to commit this soon.

Simon, I only have one question: do you think it would be possible in the future to add an additional feature (under another issue) whereas:

My reasoning is that we would then be able to improve stopword lists without breaking backwards compatibility. I am aware many people feel stopword lists are not that important but for quite a few non-english languages they are very important, no matter how advanced the scoring mechanism is (see persian for a great example of this). I also think in the future perhaps we would consider merging in the commongrams functionality that is currently duplicated in nutch and solr so that these stoplists can be ab(used) with that method as well, so I think this kind of thing might become more important in the future.

I realize this is a new feature so it shouldnt be under this issue, but if it means this design isn't viable let me know that. otherwise i would like to commit this one first to make progress. i broke the backwards compat fixing the arabic stopwords before and I would like to not do this sort of thing again.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Robert, I see what you are alluding to. Yet, I agree this is a new issue and should be handled separately. The issues would require some changes in the api I guess or rather additions. Yet, we should commit this regardless! I would be happy to make additions to StopwordAnalyzerBase on another issue as long as we haven't released this code we can still change the API while I don't think we have to. #getStopwordSet will always return the set in use while setting the stopwordset depending on the version is internal to the class.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Ah I see, you are right. This getStopWordSet() is not static, so it is correct with the current way we are doing things in lucene. in the future if this stopwordset was created depending upon version, this will still work.

Yeah this is something completely new I just wanted to verify with you that its potentially doable...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I am going to look at this one last time and commit shortly (finally), unless anyone speaks up.