Open paksa92 opened 2 weeks ago
It still needs docs for the website
Thank you so much! š I am focusing on our v1 release first before reviewing this PR. In the meantime, I recommend using the regex with our regex
action as a workaround.
Too bad, I wanted to do it right away :( But isn't the following regex better suited for this purpose?
/^(?=.{1,253}$)(?![-_])(?!.+[-_]\.)([\w-]{1,63}\.)+[a-z]{2,63}$/iu
It better aligns with RFC 1035, doesn't it? This way, it also includes the restrictions that apply to a domain and its subdomains.
I haven't checked the regex yet. Almost every regex of Valibot is handmade. I will check the regex before merging. But please feel free to discuss and research so that the process of checking and merging will be faster.
This regex is used to validate domain names. Exactly that was my intention. I looked more closely into RFC 1035 and created this regex with the goal of adhering to the specifications as much as possible so that tools like Netscaler or DNS processes will consistently accept these domains. There is a lot of guidance in RFC 1035, especially concerning octets and the conditions for when and how certain characters are allowed.
Hereās a detailed breakdown of how the regex works:
/^(?=.{1,253}$)(?![-])(?!.+[-].)([\w-]{1,63}.)+[a-z]{2,63}$/iu
^(?=.{1,253}$)
: This part checks that the total length of the input is between 1 and 253 characters, which is the maximum allowable length for domains.
(?![-_])
: This ensures that the domain name does not begin with a hyphen (-
) or an underscore (_
).
(?!.+[-_]\.)
: This negative lookahead ensures there is no combination of a hyphen or underscore immediately before a period (.
). This prevents invalid patterns like domain-.com
.
([\w-]{1,63}\.)+
: This section validates each label (segment) of the domain name:
[\w-]{1,63}
allows between 1 and 63 alphanumeric characters or hyphens (but no _
, which is not allowed in domain names).\.
at the end ensures each label ends with a period.+
at the end allows one or more repetitions of this pattern, supporting subdomains.[a-z]{2,63}$
: This section at the end checks if the TLD (top-level domain) contains only lowercase letters and is between 2 and 63 characters long, as is typical for TLDs (e.g., .com
, .online
).
iu
: The flags i
and u
represent:
i
(ignore case): Case-insensitive, so .COM
and .com
are both accepted as valid.u
(unicode): Enables Unicode support for character classes like \w
.example.com
sub.domain.co.uk
my-site123.org
-example.com
(begins with a hyphen)example-.com
(hyphen before a period)toolongtoolongtoolongtoolongtoolongtoolongtoolong.com
(label is longer than 63 characters)This regex is designed to follow DNS standards closely, enabling reliable domain validation across different DNS systems and tools.
This way, I can still help a bit :)
Sorry @JacKyDev I didn't mean to steal your thunder. I just had some spare time left and since it been a few days since you responded I figured I'd step in instead.
I will look into the regex you proposed when I get the chance today!
@paksa92 All good, I'll survive :D
But since I already put together the regex yesterday based on what Iād read, I thought it might be a way to help out. In the end, it saves me time too :)
I'm not sure if this is common practice at Valibot, so Iāll just mention it. I would assume that the documentation could also be updated so that domainName
not only exists but can also be displayed in the documentation in the future. Iāve also asked Fabian about this in the issue.
Additionally, I would suggest that something similar to email
should also be included in the necessary documentation files.
If I haven't overlooked anything, those would be:
index.mdx
(website/src/routes/api/(actions)/domainName & website/src/routes/api/(async)/customAsync & website/src/routes/api/(async)/fallbackAsync & website/src/routes/api/(async)/intersectAsync & website/src/routes/api/(async)/lazyAsync & website/src/routes/api/(async)/unionAsync)properties.ts
(website/src/routes/api/(actions)/domainName)Also, the menu.md
in routes/api
to find domainName
in the docs.
But I'm not sure if this is common practice at Valibot or if I might be overthinking it. Perhaps you should wait for Fabian to respond, as I also asked him this question here: https://github.com/fabian-hiller/valibot/issues/894.
@JacKyDev yeah, I already mentioned that it is missing in the PR, but I will add it.
CONTRIBUTING.md
does not mention a "definition of done" for new features, maybe it should? @fabian-hiller
I suppose devs rarely look at the CONTRIBUTING.md
. So it is not too important. But feel free to create a PR.
/^(?=.{1,253}$)(?![-])(?!.+[-].)([\w-]{1,63}.)+[a-z]{2,63}$/iu
@JacKyDev I'm currently testing this regex and I think it's wrong. First off, TLD's may not exceed a length of 6. Changing the regex to:
/^(?=.{1,253}$)(?![-])(?!.+[-].)([\w-]{1,63}.)+[a-z]{2,6}$/iu
But, this regex still allows TLD's exceeding a length of 6. I suspect one of the negative lookaheads (?![-])(?!.+[-].)
causes the problem, but I'm not sure. I can't seem to fix it.
I tweaked the original regex in my PR to account for the max length of a domain name, which now should cover all the cases you mentioned:
/^(?=.{1,253}$)(?!-)([a-z0-9-]{1,63}(?<!-)\.)+[a-z]{2,6}$/iu
Please test it and let me know if you find any issues.
In the RFC, I donāt see any limitation on the domain suffix itself but rather on the entire domain and its segments.
Generally, the value is that a domain, without a .
separator,
is at least 2 characters long.
Itās true that the usual domains are about 6 characters long, but when I check online, I see valid domains like:
.saarland
.cologne
.immobilien
.computer
.international
.website
So, I believe the restriction isnāt quite correct. A domain suffix is an official label that, as far as I know, can theoretically be up to 63 octets as long as it doesnāt exceed 255 octets for the entire domain.
Correct me if you have other information, but in that case,
the regex should look like this:
/^(?=.{1,253}$)(?!-)([a-z0-9-]{1,63}(?<!-)\.)+[a-z]{2,63}$/iu;
Anyway, roughly speaking, Iāll take another look later so we can also cover this better with tests.
I also did some additional research and couldnāt find anything that limits it to 6 characters.
From my understanding, that would mean the regex should match as specified. Do you have anything that suggests otherwise?
In the meantime, I ran your tests, and apart from the one mentioned below, all tests are passing. The failing one makes sense, though, since a TLD could now be longer than 6 characters after the change.
test("should not match 'example.commmerce' - TLD too long (more than 6 characters)", () => {
expectActionIssue(action, baseIssue, ['example.commmerce']);
});
Fun fact: thereās also a .shopping TLD, which is offered for ecommerce
If you donāt find anything else, then the regex should be all set, and the next step would be to get feedback from Fabian hopefully, weāve managed to save him some work :)
You're right @JacKyDev, there is no limit of 6 characters on the TLD. My assumption was based on some stackoverflow posts I've seen.
I think the regex is all set now! I will commit it today :)
^(?=.{1,253}$)
references the number of UTF-32 code points if u
flag is added. length
property of the input string references UTF-16.
I don't know which is better for domain validation.
If both can be used, I recommend to check length
first to avoid cheking the regex for too long strings.
/^(?=.{1,253})/u.test("š".repeat(253))
yields true
while "š".repeat(253).length <= 253
yields false
.
@tats-u : I understand what you mean, but when I test this regex with emojis or other Unicode characters, it consistently flags them as invalid. This aligns with what I expected, since domains technically only allow ASCII characters. However, I can't replicate what you're describing in a unit testāhere, all tests fail as soon as characters outside of a-z, 0-9, or -
are included. So, I'm not entirely sure what you're aiming for.
From a technical standpoint, this is exactly what I'd expect, as any characters outside the ASCII range should be represented in Punycode to be valid in a domain. Converting to Punycode means that an emoji, for example, doesn't count as a single character, but as multiple ASCII characters, which would also increase the string lengthāespecially since Punycode includes a prefix.
Do you have a test example we can try? Currently, emojis and special characters like "Ć¼" are rejected, just as intended.
since domains technically only allow ASCII characters. However, I can't replicate what you're describing in a unit testāhere, all tests fail as soon as characters outside ofĀ a-z, 0-9, or -Ā are included.
If so we don't have to care about surrogate pairs. You designed this validator on the premise that we need to make Unicode domains punycoded in advance.
I was reminded that users care about bundle size. The current regex is small compared to code using length
property and sufficient.
I think weāre in a good place by strictly requiring specific values... This way, the domain itself is validated very precisely. And we wouldnāt be offering any special exceptions. Separately, one could consider adding a toPunyCode
action in the future to initially standardize a value, making it usable for everyone. This approach, I think, creates more potential.
I wouldnāt really be against the additional check, because youāre rightāitās not much more code.
I still lack experience with the process here, as a regex action might feel messy if it includes code checks. But a length Check ist really small... :) @fabian-hiller could probably answer that better, I think.
toPunyCode
I agree with it. There are 2 strategies:
toPunyCode
by the regex, then pass the return value of toPunyCode
to the return value of parse
toPunyCode
by the regex, then pass the input to the return value of parse
as isUsers pass the punycoded domains to other libraries in 1. while they utilize unpunycoded Unicode domains as are.
I think in the end it might need to be more of a combination of all options to cover every use case. :)
domainName validates a string as a real domain as the status quo. Thereās an additional action, punycodeDomainName, which implements Strategy 2. toPunyCode converts a domain, enabling both options 1 and 2.
Background:
Point 1: Only valid domains, excluding Punycode.
Point 2: The action converts the string into a Punycode string, validates it against the domain, but still returns the specified domain as the result.
Basically, as you described, itās for cases where someone wants to pass on the original domain without processing it into Punycode.
Point 3: For someone validating a domain for Netscaler processes, they would likely use the following function:
domainSchema = v.pipe(v.string(), v.toPunyCode(), v.domainName())
The aim is to check the actual domain in DNS or another service.
This approach would cover all use cases, and if someone wants to use toPunyCode for other applications, like email, that would also work.
My suggestion would be to rename the current action to rfcDomain, so we can later introduce punycodeDomain and toPunyCode to handle further use cases. Iād like some feedback from Fabian on this. Then we could finish up the current work, and consider developing the following actions as potential future features:
The advantage here would be that we offer a variety of options, covering all use cases comprehensively. Additionally, this approach would still support tree-shaking and other optimizations.
Hey š I am pretty busy at the moment. However, I will try to give you feedback this week to point you in the right direction if necessary. Thanks for your contribution!
Because I love valibot so much I wanted to do something in return for all the headache she saved me from so here's a PR that adds domain name validation as requested by #894.