apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.61k stars 1.02k forks source link

UAX29URLEmailTokenizer is not detecting some tokens as URL type [LUCENE-8278] #9325

Closed asfimport closed 6 years ago

asfimport commented 6 years ago

We are using the UAX29URLEmailTokenizer so we can use the token types in our plugins.

However, I noticed that the tokenizer is not detecting certain URLs as <URL> but <ALPHANUM> instead.

Examples that are not working:

But:

Examples that work:

I have checked this JIRA, and could not find an issue. I have tested this on Lucene (Solr) 6.4.1 and 7.3.

Could someone confirm my findings and advise what I could do to (help) resolve this issue?


Migrated from LUCENE-8278 by Junte Zhang, resolved Jun 12 2018 Attachments: LUCENE-8278.patch, patched.png, unpatched.png Linked issues:

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Confirming there is an issue, but I don't think the spellings of "example.com" and "example.net" are the problem though; more likely this is related to an end-of-input issue. This test added to TestUAX29URLEmailTokenizer fails for me:

  public void testExampleURLs() throws Exception {
    Analyzer analyzer = new Analyzer() {
      `@Override` protected TokenStreamComponents createComponents(String fieldName) {
        return new TokenStreamComponents(new UAX29URLEmailTokenizer(newAttributeFactory()));
      }};

    // A trailing space allows these to succeed
    BaseTokenStreamTestCase.assertAnalyzesTo(analyzer, "example.com ", new String[]{"example.com"}, new String[]{"<URL>"});
    BaseTokenStreamTestCase.assertAnalyzesTo(analyzer, "example.net ", new String[]{"example.net"}, new String[]{"<URL>"});

    // These fail
    BaseTokenStreamTestCase.assertAnalyzesTo(analyzer, "example.com", new String[]{"example.com"}, new String[]{"<URL>"});
    BaseTokenStreamTestCase.assertAnalyzesTo(analyzer, "example.net", new String[]{"example.net"}, new String[]{"<URL>"});
  }

So there is an issue here with no-scheme end-of-input URLs not being recognized as type <URL>.

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Hmm, "example.co" and "example.info" in the above test succeed, so the problem here is somehow related to TLD spelling.

asfimport commented 6 years ago

Junte Zhang (migrated from JIRA)

Thank you for confirming this issue Steve. We run Lucene/Solr 6.6 on our production servers, and we also found this workaround to append a whitespace to the token to work on this version. However, this workaround is no longer working on Lucene 7.3.0. I'll see if I can fix this...

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I ran a test to check all TLDs appended to "example.", and 169 out of 1543 possible TLDs have this problem:

"accountants", "ads", "aeg", "afl", "aig", "aol", "art", "audio", "autos", "aws", "axa", "bar", "bbc", "bet",
"bid", "bingo", "bms", "bnl", "bom", "boo", "bot", "box", "bzh", "cab", "cal", "cam", "camp", "car", "care", 
"careers", "cat", "cfa", "citic", "com", "coupons", "crs", "cruises", "deals", "dev", "dog", "dot", "eco", 
"esq", "eus", "fans", "fit", "foo", "fox", "frl", "fund", "gal", "games", "gdn", "gea", "gifts", "gle", 
"gmo", "goog", "hkt", "htc", "ing", "int", "ist", "itv", "jmp", "jot", "kia", "kpn", "krd", "lat", "law", 
"loans", "ltd", "man", "map", "markets", "med", "men", "mlb", "mma", "moe", "mov", "msd", "mtn", "nab", 
"nec", "new", "news", "nfl", "ngo", "now", "nra", "pay", "pet", "phd", "photos", "ping", "pnc", "pro", 
"prof", "pru", "pwc", "red", "reisen", "ren", "reviews", "run", "rwe", "sap", "sas", "sbi", "sca", "ses", 
"sew", "ski", "soy", "srl", "stc", "taxi", "tci", "tdk", "thd", "tjx", "top", "trv", "tvs", "vet", "vig", 
"vin", "wine", "works", "aco", "aigo", "arte", "bbt", "bio", "biz", "bmw", "book", "call", "cars", "cfd", 
"food", "gap", "gmx", "ink", "joy", "kim", "ltda", "menu", "meo", "mls", "moi", "mom", "mtr", "net", "nrw", 
"pink", "prod", "rent", "sapo", "sbs", "scb", "sex", "sexy", "skin", "sky", "srt", "vip"

In each of the above cases I've looked at, there is a TLD that is a prefix that is shorter by one letter (see the branch_7x TLD regex). Not sure if all such TLDs have this problem; I'll look.

Also, on branch_7x anyway (from which 7.3.0 was cut a few months ago), appending a space to the input still works around the problem for me, so I can't reproduce the non-working workaround that you say is a problem with 7.3.0, Junte Zhang.

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Not sure if all such TLDs have this problem; I'll look.

Yes, the problematic TLDs are exactly the set of those for which there exists a one-letter-shorter prefix.

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I suspect this behavior is caused as a side-effect of the fix for #6454.

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I've attached a fully-regenerated patch (which is why it's so big...) against the master branch for a fix I cooked up. In this change, the TLD macro generator partitions TLDs by whether they are prefixes of other TLDs, and by suffix length, and then the grammar tries the longest TLDs first, falling back one suffix char at a time. Currently there are only 3 buckets:

  1. None of the TLDs is a 1-character-shorter prefix of another TLD
  2. Each TLD is a prefix of another TLD by 1 character
  3. Each TLD is a prefix of another TLD by 2 characters

The TLD macro generator does not hard code the number of buckets, so it should be able to handle future TLD prefixes with suffixes of more than 2 characters.

I've added a test for example.TLD URLs at end-of-input for all TLDs, and it passes, as do all other tests in the analyzers-common module.

FYI, the fix here was complicated by the fact that JFlex doesn't support end-of-input assertion (like Java's \z) as part of a lexical rule: the <<EOF>> rule can't be combined with a regex, and zero-length lookahead assertions must match at least one character.

Junte Zhang, can you test this in your context?

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Here's the output from the TLD macro generator (ant gen-tlds) with the patch:

gen-tlds:
     [java] Found 1541 TLDs in IANA Root Zone Database at http://www.internic.net/zones/root.zone
     [java] Wrote TLD macros to '/Users/sarowe/git/lucene-solr-3/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/ASCIITLD.jflex-macro':
     [java]                       ASCIITLD: 1420 TLDs
     [java]     ASCIITLDprefix_1CharSuffix:  109 TLDs
     [java]     ASCIITLDprefix_2CharSuffix:   12 TLDs
     [java]                          Total: 1541 TLDs
asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I plan on committing this tomorrow if I don't get any feedback before then.

asfimport commented 6 years ago

Junte Zhang (migrated from JIRA)

Hi Steve, sorry for the late response. I will check this tomorrow. Thanks for picking up this bug report!

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I will check this tomorrow.

Any luck Junte Zhang?

asfimport commented 6 years ago

Junte Zhang (migrated from JIRA)

I think I have tested the patch:

patch -p1 -i LUCENE-8278.patch 
patching file lucene/analysis/common/build.xml
patching file lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/ASCIITLD.jflex-macro
patching file lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizerImpl.java
patching file lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizerImpl.jflex
patching file lucene/analysis/common/src/test/org/apache/lucene/analysis/standard/TestUAX29URLEmailTokenizer.java
patching file lucene/analysis/common/src/tools/java/org/apache/lucene/analysis/standard/GenerateJflexTLDMacros.java

 

then ant compile

Started Solr and created a core with a fieldType:

<fieldType name="urlEmail" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
      </analyzer>
</fieldType>

Then tested in the Solr Admin but didn't see a difference, but perhaps I missed something.

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Started Solr [...] Then tested in the Solr Admin but didn't see a difference

How did you start Solr? You must first run ant server to put the modified lucene-analyzers-common jar into solr/server/, which will then be used when you run bin/solr start.

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I made a Solr configset with an analyzer containing only a solr.UAX29URLEmailTokenizerFactory tokenizer, then ran Solr on master both without the patch: unpatched.png and with the patch: patched.png.

Junte Zhang: I think you're just having issues running the modified code.

Committing shortly.

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 6140d8be05d1b7aad565f53dd2ee66b984b9a379 in lucene-solr's branch refs/heads/branch_7x from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6140d8b

LUCENE-8278: Some end-of-input no-scheme domain-only URL tokens are typed as <ALPHANUM> rather than <URL>

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit ead05a10b1eff181ef24f64cf7feee91ed5a5155 in lucene-solr's branch refs/heads/master from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ead05a1

LUCENE-8278: Some end-of-input no-scheme domain-only URL tokens are typed as <ALPHANUM> rather than <URL>

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 3ed77ebd8dc85ebda2817033ec20df372735c650 in lucene-solr's branch refs/heads/branch_7_4 from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3ed77eb

LUCENE-8278: Some end-of-input no-scheme domain-only URL tokens are typed as <ALPHANUM> rather than <URL>

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit cb30a2634c0579d5279ec148aa427745fb7f55ad in lucene-solr's branch refs/heads/branch_7x from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cb30a26

LUCENE-8278: move CHANGES entry to 7.4 section

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 90e4eca9dbf622d6a9d053bdca4aaaca7add1558 in lucene-solr's branch refs/heads/master from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=90e4eca

LUCENE-8278: move CHANGES entry to 7.4 section