Invalid zip/postal codes when using locale en_GB

mikeybinns commented 1 year ago

Pre-Checks

[X] Follow our Code of Conduct.
[X] Read the Contributing Guidelines.
[X] Read the docs.
[X] Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
[X] Make sure this is a Faker issue and not related to a combination with another package.
[X] Check that this is a concrete bug. For Q&A open a GitHub Discussion or join our Discord Chat Server.
[X] The provided reproduction is a minimal reproducible example of the bug.
[ ] I am willing to provide a PR.

Describe the bug

Very closely related to #1416 but for a different locale, so I though a new ticket would be appropriate.

Great Britain post codes also have some specific rules which means that some of the generated postcodes are invalid.

I have a REGEX string for validating if a GB post code is valid, so I'm sharing it here as hopefully it will help, even if only with automated testing. This regex is created based on this gov.uk document

/^([A-Z]{1,2}[0-9]{1,2}|[A-Z]{1,2}[0-9]{1}[A-Z]{1})([0-9]{1}[ABDEFGHJLNPQRSTUWXYZ]{2})$/gi,

I think the biggest tripping block for Faker here is that the final two letters cannot be C I K M O V.

For those who come across this issue, here's a workaround to get you going in modern TypeScript:

/**
 * Generate a valid UK postcode using faker.
 * 
 * Faker's implementation is too generic to work with strict 
 * post code validators, so we must check the values returned 
 * to see if they are valid
 * @see https://github.com/faker-js/faker/issues/2390
 * 
 * @returns A promise which resolves to a valid UK postcode
 */
async function generatePostCode() {
  return new Promise<string>((resolve) => {
    let postcode = "";
    while (
      postcode
        // Remove all characters except numbers and letters (including spaces as spaces are optional)
        .replaceAll(/[^A-Z0-9]/gi, "")
        // Check value against REGEX
        .match(
          /^([A-Z]{1,2}[0-9]{1,2}|[A-Z]{1,2}[0-9]{1}[A-Z]{1})([0-9]{1}[ABDEFGHJLNPQRSTUWXYZ]{2})$/gi,
        )
    ) {
      postcode = faker.location.zipCode();
    }
    resolve(postcode);
  });
}

Minimal reproduction code

import { fakerEN_GB as faker } from "@faker-js/faker"; console.log(faker.location.zipCode());

Additional Context

In case the link breaks, here is that gov.uk document I mentioned. Appendix_C_ILR_2017_to_2018_v1_Published_28April17.pdf

Environment Info

System:
  OS: macOS 13.5.1
  CPU: (8) arm64 Apple M1
  Memory: 480.53 MB / 16.00 GB
  Shell: 5.9 - /bin/zsh
Binaries:
  Node: 18.17.1 - ~/.nvm/versions/node/v18.17.1/bin/node
  npm: 9.6.7 - ~/.nvm/versions/node/v18.17.1/bin/npm
Browsers:
  Chrome: 116.0.5845.179
  Safari: 16.6
npmPackages:
  @faker-js/faker: ^8.0.1 => 8.0.2

Which module system do you use?

[ ] CJS
[X] ESM

Used Package Manager

npm

matthewmayer commented 1 year ago

I think the tricky thing is knowing when to stop!

Because in reality the first one-two chars can't be ANY A-Z, they can only be one of the https://en.wikipedia.org/wiki/List_of_postcode_areas_in_the_United_Kingdom valid postcode areas. So AB... and AL... are valid but not AA.. or AX...

So then its tempting to hardcode those options.

But then AL only goes up from 1...10, not 1...99. So maybe we should encode them, etc etc.

The only true way to validate if a postcode is valid is to compare it against the full https://en.wikipedia.org/wiki/Postcode_Address_File

In some cases i think its OK to return plausible but invalid data if the complexity costs of adding more validity checks are too high.

I don't really know on which side of this arbitrary line this suggestion falls though!

mikeybinns commented 1 year ago

I understand what you're saying, but I think my suggestion is only related to format.

The document I've linked to (from gov.uk, which in the UK means it's an official resource) documents the format of the post code which includes possible newly created post codes. It doesn't mention specific area codes (except for examples), only the format for what can and cannot be a post code.

I understand that post codes created by faker may not match up to a real post code that actually exists, but I think it should not provide a post code that can never match because the format is incorrect.

I fully agree that hardcoding current post code areas etc. is a step too far for what the library attempts to achieve, but I do believe that my request falls on the "worth doing" side of the line, but also I understand that this may add a maintenance burden, so I understand if you decide you don't agree :)

matthewmayer commented 1 year ago

i think you're right, however at the moment zipCode() uses faker.helpers.replaceSymbols under the hood which only supports a very basic set of substitutions: #, ? and *, so i dont think this is feasible by changing the patterns alone.

ST-DDT commented 1 year ago

IIRC we (I) already considered removing replaceSymbols* in favor a more standardized approach such as string replace, regex or fake patterns.

ST-DDT commented 7 months ago

Team Decision

We should replace this usage of replaceSymbols with a call to fake (and potentially then to fromRegexp)

ST-DDT commented 2 weeks ago

I'll tackle this soon.

Would you consider the switch from replaceSymbols to fake a breaking change?

ST-DDT commented 2 weeks ago

Here is a workaround, you can use until this is fixed:

const zipCode = faker.helpers.fake([
  '{{helpers.fromRegExp("[A-Z]{1,2}[0-9]{1,2} [0-9][ABDEFGHJLNPQRSTUWXYZ]{2}")}}',
  '{{helpers.fromRegExp("[A-Z]{1,2}[0-9][A-Z] [0-9][ABDEFGHJLNPQRSTUWXYZ]{2}")}}',
]);

faker-js / faker