Closed asfimport closed 3 years ago
Uwe Schindler (@uschindler) (migrated from JIRA)
I just noticed: there are some puny code domains. Should they be converted to readable form?
Robert Muir (@rmuir) (migrated from JIRA)
No, this generator script explicitly wants only ascii and punycode domains.
Robert Muir (@rmuir) (migrated from JIRA)
In other words, this change creates the same output as what we are doing today :) You can see it from the diff. I only fixed the generated comments to refer to "IANA TLD db" rather than "root zone DB"
Uwe Schindler (@uschindler) (migrated from JIRA)
Hi, yes that's a separate issue.
I just noticed that there are more and more puny code root domains. So a further improvement would be to allow the tokenizer to match both variants. Because in full text nobody would use the puny code one.
So in short: I would open issue to add both variants: ascii variant (puny code) and decoded unicode version.
Puny code decoder is part of icu
Robert Muir (@rmuir) (migrated from JIRA)
Matching puny-decoded form is different, I'm not even sure we should do it. Here we are just getting the same exact data in an easier way.
ASF subversion and git services (migrated from JIRA)
Commit b0bd64c62020383fe8c45be2035d346f7ce6174f in lucene's branch refs/heads/main from Robert Muir https://gitbox.apache.org/repos/asf?p=lucene.git;h=b0bd64c
LUCENE-9924: generate TLD list from IANA TLD db, rather than root zone db (#77)
This adds a bit of simplicity as the file is a simple domain list, rather than a DNS zone. So the regexes parsing DNS can be removed.
Also the file may change less often as it contains JUST the list of TLDs, and not any additional DNS metadata.
Adrien Grand (@jpountz) (migrated from JIRA)
Closing after the 9.0.0 release
Currently the TLD list comes from root zone database (DNS records) and these are parsed with regular expressions. Instead we can use https://data.iana.org/TLD/tlds-alpha-by-domain.txt which is a simple list.
Migrated from LUCENE-9924 by Robert Muir (@rmuir), resolved Apr 11 2021 Pull requests: https://github.com/apache/lucene/pull/77