Syntax for numbers with units

Bagolly commented 1 year ago

Regarding syntactical changes:

Case sensititvity?
Whitespace sensitivity?
Rules for separation from other tokens? What is considered the end of a unit?

Whitespace sensitivity:

Case 1: Exactly one whitespace

  string expression = "3 cm";

✔️ Easy to validate ❌ Inflexible regarding input ❌ Can be irritating to comform to this rule, especially in longer expressions

Case 2: No whitespace

  string expression = "3cm";

✔️ Convenient, easy-to-remember syntax
❌ Can be tricky to validate due to other formats (like hexadecimal numbers)

Case 3: Whitespace insensitive

   string expression = "3                 cm";

✔️ Most flexible regarding input ✔️ Whitespace insensitive like all other types of input ❌ Requires most validation logic most likely

Case sensitivity:

Case 1: Case-sensitive; all lowercase

   string expression = "3 cm";

✔️ Easy to remember ✔️ Easy to validate ❌ Inflexible regarding input ❌ Not everyone prefers all lowercase

Case 2: Case-sensitive; all uppercase

  string expression = "3 CM";

✔️ Easy to remember ✔️ Easy to validate ❌ Inflexible regarding input ❌ Might be harder to read, especially in longer expressions ❌ Not everyone prefers all uppercase

Case 3: Case-insensitive

  string expression = "3 cM";

✔️ Most flexible regarding output ✔️ No need to remember casing ❌ Can cause collisions because casing often matters in units ❌ Requires most code (not by much overall, however)

Mimicicax commented 1 year ago

Inflexibility is not really an issue; we are all familiar with the drawbacks of certain scripting languages with lenient parsers
Whether the units should be case-sensitive or not is answered by NIST at: https://www.nist.gov/pml/owm/writing-si-metric-system-units. Specifically:

The names of all units start with a lower case letter except, of course, at the beginning of the sentence

Unit symbols are written in lower case letters except for liter and those units derived from the name of a person (m for meter, but W for watt, Pa for pascal, etc.)

Symbols of prefixes that mean a million or more are capitalized and those less than a million are lower case (M for mega (millions), m for milli (thousandths))

My suggestion: _unitName (e.g. 25_cm) (the _ should be part of the suffix) This does not cause issues with hexadecimals and is in fact how C++ implements user-defined literals (for built-in literals, they do not require the _ but none of those start with letters in the range A-F, so confusion with hexadecimal literals is not an issue there).

Unless we can guarantee that no literals will be added in the future that start with the letters A-F, the _unitName syntax is probably the best solution.

Bagolly commented 10 months ago

Nah too much C++. For now, it's case sensitive, but one whitespace sounds like the best compromise between readability (and thereby leniency), as well as easy parsing.

One potential problem with parsing , is that associating the unit itself with the literal at tokenization time would require for a unit token check after each number parse, even though units are unlikely to be used all that much. A possible solution is to parse them separately, but then deciding whether it's a variable or a unit for a previous token would add a lot of computation for an 'interpreted' language like WHISKR.

So I will be dropping this feature for now, and consider it an idea to be integrated to the new WHISKR.

Mimicicax commented 10 months ago

Noob

Bagolly commented 8 months ago

Closed; too open.

Bagolly / WHISKR-A