arxlang / astx

https://astx.arxlang.org/
Other
1 stars 4 forks source link

Add support for utf8 string #40

Open xmnlab opened 8 months ago

xmnlab commented 8 months ago

from gpt:

1. Define EncodingKind Enum

First, ensure that you have an enumeration to represent the different encoding formats supported by your Char and String literals.

In astx/enums.py, add:

# astx/enums.py

from enum import Enum
from public import public

@public
class EncodingKind(Enum):
    """Enumeration of supported encoding formats."""
    ASCII = 1
    UTF8 = 2
    UTF16 = 3
    UTF32 = 4
    # Add more encodings as needed

2. Update ASTKind Enum

Add new kinds for CharType, StringType, LiteralChar, and LiteralString to your ASTKind enumeration.

In astx/base.py, add:

# astx/base.py

@public
class ASTKind(Enum):
    """The expression kind class used for downcasting."""

    # ... existing kinds ...

    CharTypeKind = -820          # Add this line
    StringTypeKind = -821        # Add this line
    LiteralCharKind = -822        # Add this line
    LiteralStringKind = -823      # Add this line

    # ... rest of the code ...

3. Define Char and String Classes

Create classes to represent character and string types with support for different encodings.

In astx/datatypes.py, add:

# astx/datatypes.py

from typing import Optional
from public import public
from astx.base import DataType, ASTKind, SourceLocation
from astx.enums import EncodingKind
from astx.types import ReprStruct

@public
class Char(DataTypeOps):
    """Character data type expression with encoding support."""

    encoding: EncodingKind
    nbytes: int

    def __init__(
        self,
        encoding: EncodingKind = EncodingKind.UTF8,  # Default encoding
        loc: SourceLocation = NO_SOURCE_LOCATION,
    ) -> None:
        """Initialize the Char type."""
        super().__init__()
        self.encoding = encoding
        self.kind = ASTKind.CharTypeKind

        # Define the number of bytes based on encoding
        encoding_nbytes = {
            EncodingKind.ASCII: 1,
            EncodingKind.UTF8: 1,   # Variable-length in reality, but default to 1 for simplicity
            EncodingKind.UTF16: 2,
            EncodingKind.UTF32: 4,
        }
        self.nbytes = encoding_nbytes.get(encoding, 1)

    def __str__(self) -> str:
        """Return a string representation of the Char type."""
        return f"Char(encoding={self.encoding.name})"

    def get_struct(self, simplified: bool = False) -> ReprStruct:
        """Return the AST structure of the Char type."""
        key = "CharType"
        value = {
            "encoding": self.encoding.name,
            "nbytes": self.nbytes,
        }
        return self._prepare_struct(key, value, simplified)

@public
class String(DataTypeOps):
    """String data type expression with encoding support."""

    encoding: EncodingKind

    def __init__(
        self,
        encoding: EncodingKind = EncodingKind.UTF8,  # Default encoding
        loc: SourceLocation = NO_SOURCE_LOCATION,
    ) -> None:
        """Initialize the String type."""
        super().__init__()
        self.encoding = encoding
        self.kind = ASTKind.StringTypeKind

    def __str__(self) -> str:
        """Return a string representation of the String type."""
        return f"String(encoding={self.encoding.name})"

    def get_struct(self, simplified: bool = False) -> ReprStruct:
        """Return the AST structure of the String type."""
        key = "StringType"
        value = {
            "encoding": self.encoding.name,
        }
        return self._prepare_struct(key, value, simplified)

4. Define LiteralChar and LiteralString Classes

Create classes to represent character and string literals with encoding information.

In astx/datatypes.py, add:

# astx/datatypes.py

@public
class LiteralChar(Literal):
    """LiteralChar data type class."""

    value: str
    encoding: EncodingKind

    def __init__(
        self,
        value: str,
        encoding: EncodingKind = EncodingKind.UTF8,  # Default encoding
        loc: SourceLocation = NO_SOURCE_LOCATION,
    ) -> None:
        """Initialize LiteralChar."""
        super().__init__(loc)
        self.value = value
        self.encoding = encoding
        self.type_ = Char(encoding=encoding)
        self.loc = loc

    def __str__(self) -> str:
        """Return a string representation of the character literal."""
        return f"LiteralChar(value='{self.value}', encoding={self.encoding.name})"

    def get_struct(self, simplified: bool = False) -> ReprStruct:
        """Return the AST representation for the character literal."""
        key = f"LiteralChar[{self.encoding.name}]: '{self.value}'"
        value = {
            "value": self.value,
            "encoding": self.encoding.name,
        }
        return self._prepare_struct(key, value, simplified)

@public
class LiteralString(Literal):
    """LiteralString data type class."""

    value: str
    encoding: EncodingKind

    def __init__(
        self,
        value: str,
        encoding: EncodingKind = EncodingKind.UTF8,  # Default encoding
        loc: SourceLocation = NO_SOURCE_LOCATION,
    ) -> None:
        """Initialize LiteralString."""
        super().__init__(loc)
        self.value = value
        self.encoding = encoding
        self.type_ = String(encoding=encoding)
        self.loc = loc

    def __str__(self) -> str:
        """Return a string representation of the string literal."""
        return f"LiteralString(value=\"{self.value}\", encoding={self.encoding.name})"

    def get_struct(self, simplified: bool = False) -> ReprStruct:
        """Return the AST representation for the string literal."""
        key = f"LiteralString[{self.encoding.name}]: \"{self.value}\""
        value = {
            "value": self.value,
            "encoding": self.encoding.name,
        }
        return self._prepare_struct(key, value, simplified)

5. Update astx/__init__.py

Ensure that the new classes are exported from your module by updating the __init__.py file.

In astx/__init__.py, add:

# astx/__init__.py

from astx.expressions import (
    # ... existing imports ...
    TypeExpr,
    # ... other expressions ...
)
from astx.literals import (
    CharLiteral,
    StringLiteral,
    # ... existing literals ...
)
from astx.types import (
    Char,
    String,
)
from astx.enums import (
    EncodingKind,
)

__all__ = [
    # ... existing exports ...
    "Char",
    "String",
    "CharLiteral",
    "StringLiteral",
    "EncodingKind",
    # ... other exports ...
]

6. Example Usage

Here's how you might use the newly created Char, String, LiteralChar, and LiteralString classes in your AST.

a. Creating a UTF8 String Literal

from astx.datatypes import LiteralString
from astx.enums import EncodingKind
from astx.base import SourceLocation

utf8_string_literal = LiteralString(
    value="Hello, World!",
    encoding=EncodingKind.UTF8,
    loc=SourceLocation(line=1, col=0)
)

print(utf8_string_literal)

Output:

LiteralString(value="Hello, World!", encoding=UTF8)

b. Creating a UTF16 Char Literal

from astx.datatypes import LiteralChar
from astx.enums import EncodingKind
from astx.base import SourceLocation

utf16_char_literal = LiteralChar(
    value="😊",
    encoding=EncodingKind.UTF16,
    loc=SourceLocation(line=2, col=0)
)

print(utf16_char_literal)

Output:

LiteralChar(value='😊', encoding=UTF16)

c. Defining a String Type with UTF32 Encoding

from astx.datatypes import String
from astx.enums import EncodingKind
from astx.base import SourceLocation

utf32_string_type = String(
    encoding=EncodingKind.UTF32,
    loc=SourceLocation(line=3, col=0)
)

print(utf32_string_type)

Output:

String(encoding=UTF32)

7. Best Practices and Recommendations

  1. Single Class with Encoding Field:

    • Simplicity: Using a single class with an encoding field avoids class proliferation and keeps the type hierarchy manageable.
    • Flexibility: Easily supports adding new encodings without modifying the class structure.
    • Maintainability: Simplifies maintenance and understanding of the type system.
  2. Consistency with Existing Classes:

    • Ensure that Char and String classes follow the same structure and conventions as your numeric and boolean types.
    • Maintain similar naming conventions and initialization patterns to facilitate ease of use and understanding.
  3. Encoding Validation:

    • Implement validation logic (if applicable) to ensure that the value of literals conforms to the specified encoding. This can be handled during semantic analysis or type checking phases.
  4. Integration with Code Generation:

    • During code generation, utilize the encoding field to emit correctly encoded literals and handle type-specific behaviors based on encoding.
    • Ensure that the code generation phase correctly interprets the encoding to produce valid output code.
  5. Extensibility:

    • Design the EncodingKind enum to be easily extendable with additional encoding formats as needed, ensuring future-proofing of your AST.
  6. Documentation:

    • Add comprehensive docstrings to the new classes to explain their purpose, usage, and attributes.
    • Update any relevant documentation or tutorials to include examples of using Char and String types and their literals.

Summary of Code Fragments

1. astx/enums.py

@public
class EncodingKind(Enum):
    """Enumeration of supported encoding formats."""
    ASCII = 1
    UTF8 = 2
    UTF16 = 3
    UTF32 = 4
    # Add more encodings as needed

2. astx/base.py

@public
class ASTKind(Enum):
    """The expression kind class used for downcasting."""

    # ... existing kinds ...

    CharTypeKind = -820          # Add this line
    StringTypeKind = -821        # Add this line
    LiteralCharKind = -822        # Add this line
    LiteralStringKind = -823      # Add this line

    # ... rest of the code ...

3. astx/datatypes.py

@public
class Char(DataTypeOps):
    """Character data type expression with encoding support."""

    encoding: EncodingKind
    nbytes: int

    def __init__(
        self,
        encoding: EncodingKind = EncodingKind.UTF8,  # Default encoding
        loc: SourceLocation = NO_SOURCE_LOCATION,
    ) -> None:
        """Initialize the Char type."""
        super().__init__()
        self.encoding = encoding
        self.kind = ASTKind.CharTypeKind

        # Define the number of bytes based on encoding
        encoding_nbytes = {
            EncodingKind.ASCII: 1,
            EncodingKind.UTF8: 1,   # Variable-length in reality, but default to 1 for simplicity
            EncodingKind.UTF16: 2,
            EncodingKind.UTF32: 4,
        }
        self.nbytes = encoding_nbytes.get(encoding, 1)

    def __str__(self) -> str:
        """Return a string representation of the Char type."""
        return f"Char(encoding={self.encoding.name})"

    def get_struct(self, simplified: bool = False) -> ReprStruct:
        """Return the AST structure of the Char type."""
        key = "CharType"
        value = {
            "encoding": self.encoding.name,
            "nbytes": self.nbytes,
        }
        return self._prepare_struct(key, value, simplified)

@public
class String(DataTypeOps):
    """String data type expression with encoding support."""

    encoding: EncodingKind

    def __init__(
        self,
        encoding: EncodingKind = EncodingKind.UTF8,  # Default encoding
        loc: SourceLocation = NO_SOURCE_LOCATION,
    ) -> None:
        """Initialize the String type."""
        super().__init__()
        self.encoding = encoding
        self.kind = ASTKind.StringTypeKind

    def __str__(self) -> str:
        """Return a string representation of the String type."""
        return f"String(encoding={self.encoding.name})"

    def get_struct(self, simplified: bool = False) -> ReprStruct:
        """Return the AST structure of the String type."""
        key = "StringType"
        value = {
            "encoding": self.encoding.name,
        }
        return self._prepare_struct(key, value, simplified)

@public
class LiteralChar(Literal):
    """LiteralChar data type class."""

    value: str
    encoding: EncodingKind

    def __init__(
        self,
        value: str,
        encoding: EncodingKind = EncodingKind.UTF8,  # Default encoding
        loc: SourceLocation = NO_SOURCE_LOCATION,
    ) -> None:
        """Initialize LiteralChar."""
        super().__init__(loc)
        self.value = value
        self.encoding = encoding
        self.type_ = Char(encoding=encoding)
        self.loc = loc

    def __str__(self) -> str:
        """Return a string representation of the character literal."""
        return f"LiteralChar(value='{self.value}', encoding={self.encoding.name})"

    def get_struct(self, simplified: bool = False) -> ReprStruct:
        """Return the AST representation for the character literal."""
        key = f"LiteralChar[{self.encoding.name}]: '{self.value}'"
        value = {
            "value": self.value,
            "encoding": self.encoding.name,
        }
        return self._prepare_struct(key, value, simplified)

@public
class LiteralString(Literal):
    """LiteralString data type class."""

    value: str
    encoding: EncodingKind

    def __init__(
        self,
        value: str,
        encoding: EncodingKind = EncodingKind.UTF8,  # Default encoding
        loc: SourceLocation = NO_SOURCE_LOCATION,
    ) -> None:
        """Initialize LiteralString."""
        super().__init__(loc)
        self.value = value
        self.encoding = encoding
        self.type_ = String(encoding=encoding)
        self.loc = loc

    def __str__(self) -> str:
        """Return a string representation of the string literal."""
        return f"LiteralString(value=\"{self.value}\", encoding={self.encoding.name})"

    def get_struct(self, simplified: bool = False) -> ReprStruct:
        """Return the AST representation for the string literal."""
        key = f"LiteralString[{self.encoding.name}]: \"{self.value}\""
        value = {
            "value": self.value,
            "encoding": self.encoding.name,
        }
        return self._prepare_struct(key, value, simplified)

4. Update astx/__init__.py

Ensure that the new classes are exported from your module by updating the __init__.py file.

In astx/__init__.py, add:

# astx/__init__.py

from astx.expressions import (
    # ... existing imports ...
    TypeExpr,
    # ... other expressions ...
)
from astx.literals import (
    CharLiteral,
    StringLiteral,
    # ... existing literals ...
)
from astx.types import (
    Char,
    String,
)
from astx.enums import (
    EncodingKind,
)

__all__ = [
    # ... existing exports ...
    "Char",
    "String",
    "CharLiteral",
    "StringLiteral",
    "EncodingKind",
    # ... other exports ...
]

5. Additional Considerations

a. Integration with Existing Operations

Since your DataTypeOps class already overloads various magic methods for operations like __add__, __eq__, etc., ensure that Char and String types can participate in these operations if applicable.

For example, you might want to define how strings can be concatenated (__add__), or how characters can be compared.

b. Validation of Encodings

Depending on your compiler's requirements, you might need to validate that the value of literals conforms to the specified encoding. This can be handled during the semantic analysis phase.

c. Code Generation

During the code generation phase, utilize the encoding field to emit correctly encoded literals and handle type-specific behaviors based on encoding. This ensures that the generated code correctly represents the intended literals.


6. Example Usage

Here's how you might use the newly created Char, String, LiteralChar, and LiteralString classes in your AST.

a. Creating a UTF8 String Literal

from astx.datatypes import LiteralString
from astx.enums import EncodingKind
from astx.base import SourceLocation

utf8_string_literal = LiteralString(
    value="Hello, World!",
    encoding=EncodingKind.UTF8,
    loc=SourceLocation(line=1, col=0)
)

print(utf8_string_literal)

Output:

LiteralString(value="Hello, World!", encoding=UTF8)

b. Creating a UTF16 Char Literal

from astx.datatypes import LiteralChar
from astx.enums import EncodingKind
from astx.base import SourceLocation

utf16_char_literal = LiteralChar(
    value="😊",
    encoding=EncodingKind.UTF16,
    loc=SourceLocation(line=2, col=0)
)

print(utf16_char_literal)

Output:

LiteralChar(value='😊', encoding=UTF16)

c. Defining a String Type with UTF32 Encoding

from astx.datatypes import String
from astx.enums import EncodingKind
from astx.base import SourceLocation

utf32_string_type = String(
    encoding=EncodingKind.UTF32,
    loc=SourceLocation(line=3, col=0)
)

print(utf32_string_type)

Output:

String(encoding=UTF32)

7. Conclusion

By following this structured approach, you've extended your astx module to include Char and String types along with their corresponding literals, supporting various encodings. This design aligns with best practices observed in compiler design, ensuring flexibility, maintainability, and consistency across your AST representation.