fabian-hiller / valibot

The modular and type safe schema library for validating structural data šŸ¤–
https://valibot.dev
MIT License
6.32k stars 204 forks source link

feat: add domain name action #907

Open paksa92 opened 2 weeks ago

paksa92 commented 2 weeks ago

Because I love valibot so much I wanted to do something in return for all the headache she saved me from so here's a PR that adds domain name validation as requested by #894.

paksa92 commented 2 weeks ago

It still needs docs for the website

fabian-hiller commented 2 weeks ago

Thank you so much! šŸ™ I am focusing on our v1 release first before reviewing this PR. In the meantime, I recommend using the regex with our regex action as a workaround.

JacKyDev commented 2 weeks ago

Too bad, I wanted to do it right away :( But isn't the following regex better suited for this purpose?

/^(?=.{1,253}$)(?![-_])(?!.+[-_]\.)([\w-]{1,63}\.)+[a-z]{2,63}$/iu

It better aligns with RFC 1035, doesn't it? This way, it also includes the restrictions that apply to a domain and its subdomains.

fabian-hiller commented 2 weeks ago

I haven't checked the regex yet. Almost every regex of Valibot is handmade. I will check the regex before merging. But please feel free to discuss and research so that the process of checking and merging will be faster.

JacKyDev commented 2 weeks ago

This regex is used to validate domain names. Exactly that was my intention. I looked more closely into RFC 1035 and created this regex with the goal of adhering to the specifications as much as possible so that tools like Netscaler or DNS processes will consistently accept these domains. There is a lot of guidance in RFC 1035, especially concerning octets and the conditions for when and how certain characters are allowed.

Hereā€™s a detailed breakdown of how the regex works:

/^(?=.{1,253}$)(?![-])(?!.+[-].)([\w-]{1,63}.)+[a-z]{2,63}$/iu

Explanation of each part:

  1. ^(?=.{1,253}$): This part checks that the total length of the input is between 1 and 253 characters, which is the maximum allowable length for domains.

  2. (?![-_]): This ensures that the domain name does not begin with a hyphen (-) or an underscore (_).

  3. (?!.+[-_]\.): This negative lookahead ensures there is no combination of a hyphen or underscore immediately before a period (.). This prevents invalid patterns like domain-.com.

  4. ([\w-]{1,63}\.)+: This section validates each label (segment) of the domain name:

    • [\w-]{1,63} allows between 1 and 63 alphanumeric characters or hyphens (but no _, which is not allowed in domain names).
    • \. at the end ensures each label ends with a period.
    • The + at the end allows one or more repetitions of this pattern, supporting subdomains.
  5. [a-z]{2,63}$: This section at the end checks if the TLD (top-level domain) contains only lowercase letters and is between 2 and 63 characters long, as is typical for TLDs (e.g., .com, .online).

  6. iu: The flags i and u represent:

    • i (ignore case): Case-insensitive, so .COM and .com are both accepted as valid.
    • u (unicode): Enables Unicode support for character classes like \w.

Examples of valid domains:

Examples of invalid domains:

This regex is designed to follow DNS standards closely, enabling reliable domain validation across different DNS systems and tools.

This way, I can still help a bit :)

paksa92 commented 2 weeks ago

Sorry @JacKyDev I didn't mean to steal your thunder. I just had some spare time left and since it been a few days since you responded I figured I'd step in instead.

I will look into the regex you proposed when I get the chance today!

JacKyDev commented 2 weeks ago

@paksa92 All good, I'll survive :D
But since I already put together the regex yesterday based on what Iā€™d read, I thought it might be a way to help out. In the end, it saves me time too :)

JacKyDev commented 2 weeks ago

I'm not sure if this is common practice at Valibot, so Iā€™ll just mention it. I would assume that the documentation could also be updated so that domainName not only exists but can also be displayed in the documentation in the future. Iā€™ve also asked Fabian about this in the issue.

Additionally, I would suggest that something similar to email should also be included in the necessary documentation files.

If I haven't overlooked anything, those would be:

Also, the menu.md in routes/api to find domainName in the docs.

But I'm not sure if this is common practice at Valibot or if I might be overthinking it. Perhaps you should wait for Fabian to respond, as I also asked him this question here: https://github.com/fabian-hiller/valibot/issues/894.

paksa92 commented 2 weeks ago

@JacKyDev yeah, I already mentioned that it is missing in the PR, but I will add it.

CONTRIBUTING.md does not mention a "definition of done" for new features, maybe it should? @fabian-hiller

fabian-hiller commented 2 weeks ago

I suppose devs rarely look at the CONTRIBUTING.md. So it is not too important. But feel free to create a PR.

paksa92 commented 2 weeks ago

/^(?=.{1,253}$)(?![-])(?!.+[-].)([\w-]{1,63}.)+[a-z]{2,63}$/iu

@JacKyDev I'm currently testing this regex and I think it's wrong. First off, TLD's may not exceed a length of 6. Changing the regex to:

/^(?=.{1,253}$)(?![-])(?!.+[-].)([\w-]{1,63}.)+[a-z]{2,6}$/iu

But, this regex still allows TLD's exceeding a length of 6. I suspect one of the negative lookaheads (?![-])(?!.+[-].) causes the problem, but I'm not sure. I can't seem to fix it.

I tweaked the original regex in my PR to account for the max length of a domain name, which now should cover all the cases you mentioned:

/^(?=.{1,253}$)(?!-)([a-z0-9-]{1,63}(?<!-)\.)+[a-z]{2,6}$/iu

Please test it and let me know if you find any issues.

JacKyDev commented 2 weeks ago

In the RFC, I donā€™t see any limitation on the domain suffix itself but rather on the entire domain and its segments.

Generally, the value is that a domain, without a . separator, is at least 2 characters long.

Itā€™s true that the usual domains are about 6 characters long, but when I check online, I see valid domains like:

So, I believe the restriction isnā€™t quite correct. A domain suffix is an official label that, as far as I know, can theoretically be up to 63 octets as long as it doesnā€™t exceed 255 octets for the entire domain.

Correct me if you have other information, but in that case, the regex should look like this: /^(?=.{1,253}$)(?!-)([a-z0-9-]{1,63}(?<!-)\.)+[a-z]{2,63}$/iu;

Anyway, roughly speaking, Iā€™ll take another look later so we can also cover this better with tests.

JacKyDev commented 2 weeks ago

I also did some additional research and couldnā€™t find anything that limits it to 6 characters.

From my understanding, that would mean the regex should match as specified. Do you have anything that suggests otherwise?

In the meantime, I ran your tests, and apart from the one mentioned below, all tests are passing. The failing one makes sense, though, since a TLD could now be longer than 6 characters after the change.


test("should not match 'example.commmerce' - TLD too long (more than 6 characters)", () => {
    expectActionIssue(action, baseIssue, ['example.commmerce']);
});

Fun fact: thereā€™s also a .shopping TLD, which is offered for ecommerce

If you donā€™t find anything else, then the regex should be all set, and the next step would be to get feedback from Fabian hopefully, weā€™ve managed to save him some work :)

paksa92 commented 1 week ago

You're right @JacKyDev, there is no limit of 6 characters on the TLD. My assumption was based on some stackoverflow posts I've seen.

I think the regex is all set now! I will commit it today :)

tats-u commented 1 week ago

^(?=.{1,253}$) references the number of UTF-32 code points if u flag is added. length property of the input string references UTF-16. I don't know which is better for domain validation. If both can be used, I recommend to check length first to avoid cheking the regex for too long strings.

/^(?=.{1,253})/u.test("šŸ˜€".repeat(253)) yields true while "šŸ˜€".repeat(253).length <= 253 yields false.

JacKyDev commented 1 week ago

@tats-u : I understand what you mean, but when I test this regex with emojis or other Unicode characters, it consistently flags them as invalid. This aligns with what I expected, since domains technically only allow ASCII characters. However, I can't replicate what you're describing in a unit testā€”here, all tests fail as soon as characters outside of a-z, 0-9, or - are included. So, I'm not entirely sure what you're aiming for.

From a technical standpoint, this is exactly what I'd expect, as any characters outside the ASCII range should be represented in Punycode to be valid in a domain. Converting to Punycode means that an emoji, for example, doesn't count as a single character, but as multiple ASCII characters, which would also increase the string lengthā€”especially since Punycode includes a prefix.

Do you have a test example we can try? Currently, emojis and special characters like "Ć¼" are rejected, just as intended.

tats-u commented 1 week ago

since domains technically only allow ASCII characters. However, I can't replicate what you're describing in a unit testā€”here, all tests fail as soon as characters outside ofĀ a-z, 0-9, or -Ā are included.

If so we don't have to care about surrogate pairs. You designed this validator on the premise that we need to make Unicode domains punycoded in advance. I was reminded that users care about bundle size. The current regex is small compared to code using length property and sufficient.

JacKyDev commented 1 week ago

I think weā€™re in a good place by strictly requiring specific values... This way, the domain itself is validated very precisely. And we wouldnā€™t be offering any special exceptions. Separately, one could consider adding a toPunyCode action in the future to initially standardize a value, making it usable for everyone. This approach, I think, creates more potential.

I wouldnā€™t really be against the additional check, because youā€™re rightā€”itā€™s not much more code.

I still lack experience with the process here, as a regex action might feel messy if it includes code checks. But a length Check ist really small... :) @fabian-hiller could probably answer that better, I think.

tats-u commented 1 week ago

toPunyCode

I agree with it. There are 2 strategies:

  1. Test the return value of toPunyCode by the regex, then pass the return value of toPunyCode to the return value of parse
  2. Test the return value of toPunyCode by the regex, then pass the input to the return value of parse as is

Users pass the punycoded domains to other libraries in 1. while they utilize unpunycoded Unicode domains as are.

JacKyDev commented 1 week ago

I think in the end it might need to be more of a combination of all options to cover every use case. :)

domainName validates a string as a real domain as the status quo. Thereā€™s an additional action, punycodeDomainName, which implements Strategy 2. toPunyCode converts a domain, enabling both options 1 and 2.

Background:

Point 1: Only valid domains, excluding Punycode.

Point 2: The action converts the string into a Punycode string, validates it against the domain, but still returns the specified domain as the result.

Basically, as you described, itā€™s for cases where someone wants to pass on the original domain without processing it into Punycode.

Point 3: For someone validating a domain for Netscaler processes, they would likely use the following function:

domainSchema = v.pipe(v.string(), v.toPunyCode(), v.domainName())

The aim is to check the actual domain in DNS or another service.

This approach would cover all use cases, and if someone wants to use toPunyCode for other applications, like email, that would also work.

My suggestion would be to rename the current action to rfcDomain, so we can later introduce punycodeDomain and toPunyCode to handle further use cases. Iā€™d like some feedback from Fabian on this. Then we could finish up the current work, and consider developing the following actions as potential future features:

The advantage here would be that we offer a variety of options, covering all use cases comprehensively. Additionally, this approach would still support tree-shaking and other optimizations.

fabian-hiller commented 1 week ago

Hey šŸ‘‹ I am pretty busy at the moment. However, I will try to give you feedback this week to point you in the right direction if necessary. Thanks for your contribution!