apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.74k stars 1.04k forks source link

Add a phone number normalization TokenFilter [LUCENE-3663] #4737

Open asfimport opened 12 years ago

asfimport commented 12 years ago

Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present.


Migrated from LUCENE-3663 by Santiago M. Mola, 2 votes, updated Jan 24 2012 Attachments: PhoneFilter.java (versions: 2)

asfimport commented 12 years ago

Santiago M. Mola (migrated from JIRA)

This is a proof-of-concept TokenFilter that does the job using Google's libphonenumber (https://code.google.com/p/libphonenumber/).

Each token is converted to a phone number in international format, using a default country for guessing country code if needed. If the token is not a valid phone number, it's filtered out.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

This looks strange and creates useless objects:

final char[] buffer = termAtt.buffer();
final int length = termAtt.length();
CharBuffer cb = CharBuffer.wrap(buffer, 0, length);
try {
    PhoneNumber pn = pnu.parse(cb.toString(), defaultCountry);

should be:

try {
    PhoneNumber pn = pnu.parse(termAtt.toString(), defaultCountry);

Ideally, PhoneNumberUtil would take CharSequence (so you could directly pass termAtt without toString()), but unfortunately Google's lib is too stupid to use a more generic Java type.

Otherwise patch looks fine, but it adds another external library. You should make all fields final, they will never change!

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

One more thing, as you want to filter out tokens, you should not subclass TokenFilter directly but instead sublass org.apache.lucene.analysis.util.FilteringTokenFilter and do the work in the accept() method. You are free to modify the token there, too. This new base class would correctly handle position increments, as noted as TODO in your comments.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 I think this would be a useful addition.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think actually that we should not remove tokens that aren't phone numbers. sometimes there just might be other things instead of phone numbers, or maybe the phone number detection/normalization is just imperfect so its better to not throw away, instead just no normalization happens, like a stemmer.

In general we can also assume the text is unstructured and might have other stuff (this implies someone has a super-cool tokenizer that doesnt split up any dirty phone numbers, but we just leave the possibility)

Then i think the while loop could be removed, if the phone number normalization succeeds mark the type as phone. Otherwise in the exception case, output it unchanged.

then non-phonenumbers or whatever can be easily filtered out separately with a subclass of FilteringTokenFilter.

asfimport commented 12 years ago

Santiago M. Mola (migrated from JIRA)

@Uwe: Thanks for the comments.

@Robert: Then this filter would mark phone tokens as <PHONE> type and I could filter non-<PHONE> tokens with a subsequent filter? In my specific use case, I need to throw away any token that could not be normalized, so I have to, at least, mark phone tokens for removal in further steps. If tokens are not marked, then we would have to check twice if the token is a valid phone.

asfimport commented 12 years ago

Santiago M. Mola (migrated from JIRA)

Bug report for libphonenumber in order to get it to support CharSequence: https://code.google.com/p/libphonenumber/issues/detail?id=84

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Santiago, yeah i think if normalization is successful, you would change the type to <PHONE> as it was recognized as one. otherwise when you get the exception, just 'return true' and leave all attributes unchanged.

in the successful case, besides setting the type, if you wanted you could even not throw away the PhoneNumber or whatever but instead put it in an attribute. This way if someone wanted to do more complicated stuff the attributes are at least available, but its also useful for things like solr's analysis.jsp just for debugging how the analysis worked.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Then this filter would mark phone tokens as <PHONE> type and I could filter non-<PHONE> tokens with a subsequent filter?

YES!.

The FilteringTokenFilter subclass you then would add after this filterw ould simply has this accept() method:

protected boolean accept() {
 return "<PHONE>".equals(typeAtt.getType());
}

FilteringTokenFilter would then also support position increments correctly, that your filter does not.

asfimport commented 12 years ago

Santiago M. Mola (migrated from JIRA)

Modified considering your comments.