apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

regenerate TLD list from IANA TLD db, rather than root zone db [LUCENE-9924] #10963

Closed asfimport closed 3 years ago

asfimport commented 3 years ago

Currently the TLD list comes from root zone database (DNS records) and these are parsed with regular expressions. Instead we can use https://data.iana.org/TLD/tlds-alpha-by-domain.txt which is a simple list.


Migrated from LUCENE-9924 by Robert Muir (@rmuir), resolved Apr 11 2021 Pull requests: https://github.com/apache/lucene/pull/77

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I just noticed: there are some puny code domains. Should they be converted to readable form?

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

No, this generator script explicitly wants only ascii and punycode domains.

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

In other words, this change creates the same output as what we are doing today :) You can see it from the diff. I only fixed the generated comments to refer to "IANA TLD db" rather than "root zone DB"

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, yes that's a separate issue.

I just noticed that there are more and more puny code root domains. So a further improvement would be to allow the tokenizer to match both variants. Because in full text nobody would use the puny code one.

So in short: I would open issue to add both variants: ascii variant (puny code) and decoded unicode version.

Puny code decoder is part of icu

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Matching puny-decoded form is different, I'm not even sure we should do it. Here we are just getting the same exact data in an easier way.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit b0bd64c62020383fe8c45be2035d346f7ce6174f in lucene's branch refs/heads/main from Robert Muir https://gitbox.apache.org/repos/asf?p=lucene.git;h=b0bd64c

LUCENE-9924: generate TLD list from IANA TLD db, rather than root zone db (#77)

This adds a bit of simplicity as the file is a simple domain list, rather than a DNS zone. So the regexes parsing DNS can be removed.

Also the file may change less often as it contains JUST the list of TLDs, and not any additional DNS metadata.

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release