False positive mixed_case_recommended_field for numeric values

MKuranowski commented 1 year ago

Describe the bug

The mixed_case_recommended_field warning is always triggered when a field contains digits, like route_short_name=A1 or trip_short_name=101.

Characters which are not letters should not be considered at all when comparing case.

Steps/Code to Reproduce

Validate a file which uses numbers in fields like route_short_name or trip_short_name.

Expected Results

No mixed_case_recommended_field warning for those fileds.

Actual Results

{
      "code": "mixed_case_recommended_field",
      "severity": "WARNING",
      "totalNotices": 360,
      "sampleNotices": [
        {
          "filename": "routes.txt",
          "fieldName": "route_short_name",
          "fieldValue": "A1",
          "csvRowNumber": 2
        },
        {
          "filename": "routes.txt",
          "fieldName": "route_short_name",
          "fieldValue": "ZA12",
          "csvRowNumber": 3
        },
        {
          "filename": "trips.txt",
          "fieldName": "trip_short_name",
          "fieldValue": "301",
          "csvRowNumber": 2
        },
        {
          "filename": "trips.txt",
          "fieldName": "trip_short_name",
          "fieldValue": "5301",
          "csvRowNumber": 3
        },
        {
          "filename": "trips.txt",
          "fieldName": "trip_short_name",
          "fieldValue": "103",
          "csvRowNumber": 4
        },
        ...
}

Screenshots

No response

Files used

https://mkuran.pl/gtfs/wkd.zip

Validator version

4.1.0

Operating system

Debian 12

Java version

openjdk version "17.0.6" 2023-01-17

Additional notes

No response

welcome[bot] commented 1 year ago

Thanks for opening your first issue in this project! If you haven't already, you can join our slack and join the #gtfs-validators channel to meet our awesome community. Come say hi :wave:!

Welcome to the community and thank you for your engagement in open source! :tada:

briandonahue commented 1 year ago

@isabelle-dr How do we want to approach this? It looks like in some cases when numbers are present, we don't want to throw the warning: A1 103B 1205

In those cases we could skip the mixed case validation altogether, but how about longer strings? Should we have a length minimum or number of letters in a string before mixed case is considered? Thinking about strings like:

MY ROUTE 100 route 1 boulevard

etc

isabelle-dr commented 1 year ago

@briandonahue thanks for having a look at this. After giving this some thought, here is a proposal: We could modify the logic so we only run this check if the field value contains at least one " " character and if it doesn't contain only numbers. This means that the following values would be valid: ZA12, 1205, 1205 1.

cc @tzujenchanmbd

briandonahue commented 1 year ago

@isabelle-dr I had a couple thoughts. 1) I think the current implementation has another flaw - I believe it only checks for the 26 letters of the basic alphabet /[a-zA-Z]/ which will not match characters such as àâçéèêëîïôûùüÿñæœ (and probably others).

2) Given the above, I think maybe a better check would be:

the value contains more than one "letter" (defined as non-numeric, and non-white-space characters)
AND the original value is the same as the value after being converted to all upper OR all lower case (this will of course not affect numbers and white space, but should handle other characters more efficiently than a regex)

We may still get some false positives if there are values such as AA-103 or B-34-C or D-E etc. But it is a recommendation/warning not an error, and it shoudl cut down significantly on false positives for numeric values or single-letter + numbers.

// if more than one non-numeric, non-whitespace character
if((value.length() - countNumericChars(value) - countWhiteSpaceChars(value)) > 1) &&
// AND the value is all upper OR lower case characters
(value.toLowerCase() == value || value.toUpperCase() == value){
    notices.add(new MixedFieldNotice());
}

isabelle-dr commented 1 year ago

I like this idea! We would still get the notice for something like ZA12, right? Should we also add the following condition so that we differentiate acronyms (such as ZA12) from the natural text (such as Route 1B)?

the field value contains at least one " " character

briandonahue commented 1 year ago

I like this idea! We would still get the notice for something like ZA12, right? Should we also add the following condition so that we differentiate acronyms (such as ZA12) from the natural text (such as Route 1B)?

the field value contains at least one " " character

@isabelle-dr I thought about this, but I think we want to keep it for long words with no spaces such as Lancaster?

Another approach would be to validate any values with strings of more than 1 letter, regardless of spaces. In this case A 101 and 342B would be ignored but RTE 101, RTE23 and route 1B would be flagged. If we don't want something like RTE23 to be flagged, we could be more strict and ignore strings of mixed letters and numbers (but no spaces). So A12387abi would be ignored but ROUTE A2348B would be flagged.

briandonahue commented 1 year ago

After further thought, I think the simplest logic that will cover most cases is to validate any value that has more than one consecutive letter, regardless of numbers or spaces. It will flag things like 123ABC but that seems like it would be uncommon, and it's only a warning... we could also re-visit if users complain, and add some more complex logic, to ignore mixed number/letter strings or something.

Pseudo-code for simpler logic:

// any consecutive non-numeric, non-space characters (2 or more sequential letters)
if (/[0-9 ]{2,}/.test(value)  &&
// AND the value is all upper OR lower case characters
(value.toLowerCase() == value || value.toUpperCase() == value)){
    notices.add(new MixedFieldNotice());
}

isabelle-dr commented 1 year ago

These are very good points, thank you for giving it some thought @briandonahue! What you shared in the last message sounds great 👍

briandonahue commented 1 year ago

@isabelle-dr I've got this working (modified to allow non-English characters, etc) but I noticed the original issue lists ZA12 as a potential false positive. Thoughts? Should we leave it as a warning that can be ignored, or should we consider some larger allowable number of characters mixed with numbers in a single string 😅

MKuranowski commented 1 year ago

more than one consecutive letter

Why not more than two?

Two-letter codes for routes are almost too common. Apart from the example from Warsaw Commuter Rail, rapid suburban railway services in Berlin use them (RE and RB), Singapore MTR (EW, NS, NE, CC, DT and TE), Taipei Metro (BL and BR), rail services in Tokyo (too many to list), and many more.

I'd even go as far as too say that 3-letter codes are also relatively common, see Paris RER or Gemarny's ICE. Asian networks also use acronyms like MRT, LRT pretty often.

The posted pseudocode earlier is arguably even worse. It flags things without letters at all, or flags a very common numbering scheme where a number is prefixed or suffixed with a single letter.

obraz

The idea of value.toLowerCase() == value || value.toUpperCase() == value makes no sense if value contains characters which don't have the notion of case - which is most of characters. Digits, and pretty much every other alphabet other than Latin, Greek or Cyrylic will always return true on such a test.

MKuranowski commented 1 year ago

A much more reliable strategy would be to rely on Unicode character properties.

Catching 3 consecutive upper case letters can be done with something like /\p{Lu}{3,}/u.test(value). 3 consecutive lower case letters are more tricky, as something like /\p{Ll}{3,}/u.test(value) also catches things like Orange. I think something like (?<!\p{LC})\p{Ll}{3,}/u.test(value) (3 consecutive lower case letters not preceded by any other cased letter) works: https://regex101.com/r/GVPMmO/1.

briandonahue commented 1 year ago

@MKuranowski Thank you for the feedback! I realized some of this during my testing, and changed the approach in the draft PR. Here is a link to the current implementation, it is generated so a little hard to read, but it ends up being this:

      if (value.matches(".*\\p{L}{2}.*") && 
          (value.toLowerCase() == value || value.toUpperCase() == value)) {
      // add notice
      }

We can adjust the requirement such that we only check sequential letters of 4 or more, or whatever we choose. I can also add better test cases like the ones you included above and any other suggestions.

I simplified the check to compare upper cased and lower cased values to the original, let me know if you see any flaw with that approach.

MKuranowski commented 1 year ago

The main problem with ZA12 is still not solved though; for that 2 consecutive letters with the same case should be allowed. As I mentioned earlier, there are a lot of systems which use 2-letter codes for routes; so I don't think this should be an issue.

The v.toLowerCase() == v || v.toUpperCase() == v is not able to catch mixed-case situations (like route 1B), and, more importantly, still breaks for non-case scripts (급행12 or 東西線). The second situation could be mitigating to only matching on cased letters (\p{LC}), not just any letter from any script.

My regex flags route 1B correctly, but then has false positives on Another good value and Avenue des Champs-Élysées. Those contain all-lower-case words (good or des). I don't think it's possible to correctly categorize all 3 strings in a straightforward way, though.

v.matches(".*\\p{LC}{3}.*") && (v.toLowerCase() == v || v.toUpperCase() == v) seems to correctly categorize all mine and already-existing test cases, with the sole exception of route 1B.

briandonahue commented 1 year ago

@MKuranowski Another possibility would be to match all words/tokens greater than 3 cased characters, (no numbers, no non-case script characters), and then check that each conforms to some rules (off the top of my head, no all-caps, and at least some with initial caps?). I don't know though... at some point this is just a warning / suggestion and I think as long as we catch the bulk of the fields that should be corrected, users can ignore the warning for fields that are falsely flagged.

Also, as @isabelle-dr mentions here, in the future rules may be configurable so users that are dealing with a lot of false flags for this warning can opt to disable it.

@davidgamez @bdferris-v2 @isabelle-dr curious what your thoughts are.

briandonahue commented 1 year ago

@MKuranowski @davidgamez

This is the compromise I've come up with so far...

Split text into letter-only tokens (split on white-space, numbers, and punctuation).
Ignore any tokens that are 3 characters or less
Any tokens that are > 3 chars, must start with an uppper case letter and the rest of the token must be lower case (Title Case -ish, but TIL that Title Case means different things in different alphabets)

This validates almost every case presented as desired with a couple exceptions: Another good value will get flagged, it should be Another Good Value

MixedCase will get flagged - I'm not sure how common an upper case letter is in the middle of a token without punctuation or other characters, but this seemed like a good general rule.

Find the current test set here

and the generated Java logic looks like this:

String[] tokens = value.split("[^\\p{L}]+");
boolean isValid = true;
for (String token : tokens) {
  if (!((token.length() <= 3) || 
      (Character.isUpperCase(token.charAt(0)) && 
       token.substring(1).toLowerCase().equals(token.substring(1))))) {
    isValid = false;
    break;
  }
}

laurentg commented 1 year ago

Why not dropping altogether the mixed-case policy for route_short_name and trip_short_name? There is already a length warning on route_short_name for example, and given the data I've seen I think it's hard to determine easily and reliably if upper or lower case are OK or not, especially using non-Latin charsets (where the concept of "lowercase" is sometimes unknown).

Anyway the current state of this rule is raising way too much warnings to be useful, something has to be done.

isabelle-dr commented 1 year ago

@laurentg thanks for the ping and reminder that these false positives should be removed quickly.

Before making the decision to remove this rule, I would like to see how the logic implemented by @briandonahue in PR#1430 is doing. I suggest we do:

Look at for how many datasets from the Mobility Database this notice is disappearing with PR#1430.
Look at the list of remaining datasets that have this notice, after PR#1430.
Among them, what is the estimated % of false positives? (approx, no need for an exact number)
If more than ~5%, remove this notice from the validator until we find a better logic.

briandonahue commented 1 year ago

@isabelle-dr Do we have an existing tool/process for querying the validation results for the Mobility Dataset? I'm not sure how to go about that other than pulling down results from a test run on Github and using a JSON CLI query tool to try to query the JSON results?

szjozsef commented 1 year ago

Hello, besides the false positives on the _short_name fields, I also have issue with this :

{
    "filename": "stops.txt",
    "fieldName": "stop_name",
    "fieldValue": "\u00cenv\u0103\u021b\u0103torilor",
    "csvRowNumber": 436
}

which is an unicode-escaped string for: Învățătorilor, which has the correct mixed case format.

isabelle-dr commented 1 year ago

@briandonahue I am not sure your question is still relevant, but here is a vague answer to it 😅. We've been doing this ad-hoc, I believe a little differently every time.

Since we built the acceptance tests, we've been leveraging them by doing things like this, and then running analytics on the acceptance test report with python & pandas in a Google Colab. Then, Brian opened this PR to avoid doing that, WIP.

MobilityData / gtfs-validator