Closed MKuranowski closed 1 year ago
Thanks for opening your first issue in this project! If you haven't already, you can join our slack and join the #gtfs-validators channel to meet our awesome community. Come say hi :wave:!
Welcome to the community and thank you for your engagement in open source! :tada:
@isabelle-dr How do we want to approach this? It looks like in some cases when numbers are present, we don't want to throw the warning:
A1
103B
1205
In those cases we could skip the mixed case validation altogether, but how about longer strings? Should we have a length minimum or number of letters in a string before mixed case is considered? Thinking about strings like:
MY ROUTE 100
route 1 boulevard
etc
@briandonahue thanks for having a look at this.
After giving this some thought, here is a proposal:
We could modify the logic so we only run this check if the field value contains at least one " " character and if it doesn't contain only numbers.
This means that the following values would be valid: ZA12
, 1205
, 1205 1
.
cc @tzujenchanmbd
@isabelle-dr I had a couple thoughts. 1) I think the current implementation has another flaw - I believe it only checks for the 26 letters of the basic alphabet /[a-zA-Z]/
which will not match characters such as àâçéèêëîïôûùüÿñæœ
(and probably others).
2) Given the above, I think maybe a better check would be:
We may still get some false positives if there are values such as AA-103
or B-34-C
or D-E
etc. But it is a recommendation/warning not an error, and it shoudl cut down significantly on false positives for numeric values or single-letter + numbers.
// if more than one non-numeric, non-whitespace character
if((value.length() - countNumericChars(value) - countWhiteSpaceChars(value)) > 1) &&
// AND the value is all upper OR lower case characters
(value.toLowerCase() == value || value.toUpperCase() == value){
notices.add(new MixedFieldNotice());
}
I like this idea!
We would still get the notice for something like ZA12
, right?
Should we also add the following condition so that we differentiate acronyms (such as ZA12
) from the natural text (such as Route 1B
)?
I like this idea! We would still get the notice for something like
ZA12
, right? Should we also add the following condition so that we differentiate acronyms (such asZA12
) from the natural text (such asRoute 1B
)?
- the field value contains at least one " " character
@isabelle-dr I thought about this, but I think we want to keep it for long words with no spaces such as Lancaster
?
Another approach would be to validate any values with strings of more than 1 letter, regardless of spaces. In this case A 101
and 342B
would be ignored but RTE 101
, RTE23
and route 1B
would be flagged. If we don't want something like RTE23
to be flagged, we could be more strict and ignore strings of mixed letters and numbers (but no spaces). So A12387abi
would be ignored but ROUTE A2348B
would be flagged.
After further thought, I think the simplest logic that will cover most cases is to validate any value that has more than one consecutive letter, regardless of numbers or spaces. It will flag things like 123ABC
but that seems like it would be uncommon, and it's only a warning... we could also re-visit if users complain, and add some more complex logic, to ignore mixed number/letter strings or something.
Pseudo-code for simpler logic:
// any consecutive non-numeric, non-space characters (2 or more sequential letters)
if (/[0-9 ]{2,}/.test(value) &&
// AND the value is all upper OR lower case characters
(value.toLowerCase() == value || value.toUpperCase() == value)){
notices.add(new MixedFieldNotice());
}
These are very good points, thank you for giving it some thought @briandonahue! What you shared in the last message sounds great 👍
@isabelle-dr I've got this working (modified to allow non-English characters, etc) but I noticed the original issue lists ZA12
as a potential false positive. Thoughts? Should we leave it as a warning that can be ignored, or should we consider some larger allowable number of characters mixed with numbers in a single string 😅
more than one consecutive letter
Why not more than two?
Two-letter codes for routes are almost too common. Apart from the example from Warsaw Commuter Rail, rapid suburban railway services in Berlin use them (RE and RB), Singapore MTR (EW, NS, NE, CC, DT and TE), Taipei Metro (BL and BR), rail services in Tokyo (too many to list), and many more.
I'd even go as far as too say that 3-letter codes are also relatively common, see Paris RER or Gemarny's ICE. Asian networks also use acronyms like MRT, LRT pretty often.
The posted pseudocode earlier is arguably even worse. It flags things without letters at all, or flags a very common numbering scheme where a number is prefixed or suffixed with a single letter.
The idea of value.toLowerCase() == value || value.toUpperCase() == value
makes no sense if value contains characters which don't have the notion of case - which is most of characters. Digits, and pretty much every other alphabet other than Latin, Greek or Cyrylic will always return true on such a test.
A much more reliable strategy would be to rely on Unicode character properties.
Catching 3 consecutive upper case letters can be done with something like /\p{Lu}{3,}/u.test(value)
.
3 consecutive lower case letters are more tricky, as something like /\p{Ll}{3,}/u.test(value)
also catches things like Orange
. I think something like (?<!\p{LC})\p{Ll}{3,}/u.test(value)
(3 consecutive lower case letters not preceded by any other cased letter) works: https://regex101.com/r/GVPMmO/1.
@MKuranowski Thank you for the feedback! I realized some of this during my testing, and changed the approach in the draft PR. Here is a link to the current implementation, it is generated so a little hard to read, but it ends up being this:
if (value.matches(".*\\p{L}{2}.*") &&
(value.toLowerCase() == value || value.toUpperCase() == value)) {
// add notice
}
We can adjust the requirement such that we only check sequential letters of 4 or more, or whatever we choose. I can also add better test cases like the ones you included above and any other suggestions.
I simplified the check to compare upper cased and lower cased values to the original, let me know if you see any flaw with that approach.
The main problem with ZA12
is still not solved though; for that 2 consecutive letters with the same case should be allowed. As I mentioned earlier, there are a lot of systems which use 2-letter codes for routes; so I don't think this should be an issue.
The v.toLowerCase() == v || v.toUpperCase() == v
is not able to catch mixed-case situations (like route 1B
), and, more importantly, still breaks for non-case scripts (급행12
or 東西線
). The second situation could be mitigating to only matching on cased letters (\p{LC}
), not just any letter from any script.
My regex flags route 1B
correctly, but then has false positives on Another good value
and Avenue des Champs-Élysées
. Those contain all-lower-case words (good
or des
). I don't think it's possible to correctly categorize all 3 strings in a straightforward way, though.
v.matches(".*\\p{LC}{3}.*") && (v.toLowerCase() == v || v.toUpperCase() == v)
seems to correctly categorize all mine and already-existing test cases, with the sole exception of route 1B
.
@MKuranowski Another possibility would be to match all words/tokens greater than 3 cased characters, (no numbers, no non-case script characters), and then check that each conforms to some rules (off the top of my head, no all-caps, and at least some with initial caps?). I don't know though... at some point this is just a warning / suggestion and I think as long as we catch the bulk of the fields that should be corrected, users can ignore the warning for fields that are falsely flagged.
Also, as @isabelle-dr mentions here, in the future rules may be configurable so users that are dealing with a lot of false flags for this warning can opt to disable it.
@davidgamez @bdferris-v2 @isabelle-dr curious what your thoughts are.
@MKuranowski @davidgamez
This is the compromise I've come up with so far...
This validates almost every case presented as desired with a couple exceptions:
Another good value
will get flagged, it should be Another Good Value
MixedCase
will get flagged - I'm not sure how common an upper case letter is in the middle of a token without punctuation or other characters, but this seemed like a good general rule.
Find the current test set here
and the generated Java logic looks like this:
String[] tokens = value.split("[^\\p{L}]+");
boolean isValid = true;
for (String token : tokens) {
if (!((token.length() <= 3) ||
(Character.isUpperCase(token.charAt(0)) &&
token.substring(1).toLowerCase().equals(token.substring(1))))) {
isValid = false;
break;
}
}
Why not dropping altogether the mixed-case policy for route_short_name and trip_short_name? There is already a length warning on route_short_name for example, and given the data I've seen I think it's hard to determine easily and reliably if upper or lower case are OK or not, especially using non-Latin charsets (where the concept of "lowercase" is sometimes unknown).
Anyway the current state of this rule is raising way too much warnings to be useful, something has to be done.
@laurentg thanks for the ping and reminder that these false positives should be removed quickly.
Before making the decision to remove this rule, I would like to see how the logic implemented by @briandonahue in PR#1430 is doing. I suggest we do:
@isabelle-dr Do we have an existing tool/process for querying the validation results for the Mobility Dataset? I'm not sure how to go about that other than pulling down results from a test run on Github and using a JSON CLI query tool to try to query the JSON results?
Hello, besides the false positives on the _short_name fields, I also have issue with this :
{
"filename": "stops.txt",
"fieldName": "stop_name",
"fieldValue": "\u00cenv\u0103\u021b\u0103torilor",
"csvRowNumber": 436
}
which is an unicode-escaped string for: Învățătorilor, which has the correct mixed case format.
@briandonahue I am not sure your question is still relevant, but here is a vague answer to it 😅. We've been doing this ad-hoc, I believe a little differently every time.
Since we built the acceptance tests, we've been leveraging them by doing things like this, and then running analytics on the acceptance test report with python & pandas in a Google Colab. Then, Brian opened this PR to avoid doing that, WIP.
Describe the bug
The
mixed_case_recommended_field
warning is always triggered when a field contains digits, like route_short_name=A1
or trip_short_name=101
.Characters which are not letters should not be considered at all when comparing case.
Steps/Code to Reproduce
Validate a file which uses numbers in fields like
route_short_name
ortrip_short_name
.Expected Results
No
mixed_case_recommended_field
warning for those fileds.Actual Results
Screenshots
No response
Files used
https://mkuran.pl/gtfs/wkd.zip
Validator version
4.1.0
Operating system
Debian 12
Java version
openjdk version "17.0.6" 2023-01-17
Additional notes
No response