agentm / project-m36

Project: M36 Relational Algebra Engine
The Unlicense
876 stars 47 forks source link

utf8 attribute names can be defined but not projected. #304

Closed YuMingLiao closed 2 years ago

YuMingLiao commented 2 years ago

TutorialD (master/main): n :: {食品分類 Text} TutorialD (master/main): :showexpr n ┌──────────────┐ │食品分類::Text│ └──────────────┘ TutorialD (master/main): :showexpr n{食品分類} ____^ ERR:offset=12: unexpected '食' expecting "all but", "all from", "intersection of", "union of", '}', or lowercase letter

agentm commented 2 years ago

I found the inconsistency in the parser, but there is another issue in that, in Haskell at least, the parser uses the first letter of an identifier to determine if the identifier is a type or a name. That's what we are trying to achieve here, but, that doesn't make sense in the context of a language which has neither capitalization nor letters.

The problem would arise, for example, if you were to create a type called "食品分類":

TutorialD (master/main): data 食品分類 = 蔬菜 | 肉 
_____________________________^
ERR:offset=5:
unexpected "食品"
expecting "::", ":=", or uppercase letter

This is inline with how Haskell operates:

Prelude> data 食品分類 = 蔬菜 | 肉 

<interactive>:1:13: error: Not a data constructor: ‘蔬菜’

but prefixing the type and data constructors with a Latin capital letter works:

Prelude> data D食品分類 = D蔬菜 | D肉

I'll push a fix to make the parsing rules consistent, but what should we do long term?

Would you expect some sort of special quoting?

Is the Latin-based workaround which we would use in Haskell acceptable?

Is there another Latin-based language which supports non-Latin names better which we could emulate?

agentm commented 2 years ago

To be clear, here is the current, rather ugly workaround:

TutorialD (master/main): n :: {d食品分類 Text}
TutorialD (master/main): :showexpr n{d食品分類}
┌───────────────┐
│d食品分類::Text │
└───────────────┘

It's not ideal and I would like to work with you to come up with a better solution.

YuMingLiao commented 2 years ago

I haven't thought about it. I think it's a great solution. After all, there is no consistent way that I can think of to distinguish if a non-Latin word is a type or name or whatever. Even if I can use non-Latin names, I may need to understand their usage when reading code. I might as well just keep it explicit and conventional. Thanks!