exaloop / codon

A high-performance, zero-overhead, extensible Python compiler using LLVM
https://docs.exaloop.io/codon
Other
14.32k stars 501 forks source link

stdlib: xml.etree.ElementTree #314

Closed imgurbot12 closed 1 year ago

imgurbot12 commented 1 year ago

Hello, I love this project so far and I'm hoping to use it for my own projects but so far the lack of existing standard library modules makes it difficult to integrate with most existing code bases. One of the primary libraries that I use and gets used by many other important libraries are the xml tools.

I would be happy to contribute on helping build out the xml stdlib but it's primarily backed by a cpython parser called pyexpat which is just imported here.

Is there a way to include cpython components to allow for the build of this library or what's the best alternative and approach for implementation?

elisbyberi commented 1 year ago

@imgurbot12 Although using CPython modules goes against the Codon itself, why not simply import the modules from Python? See: https://docs.exaloop.io/codon/interoperability/python

imgurbot12 commented 1 year ago

I suppose I could, but that sort of defeats the purpose of the project, as you mentioned. My hope is to improve the performance of my code base, but if the majority of the libraries I use are still in native python, there isn't much opportunity for improvement.

Appreciate the suggestion nonetheless. Thanks

elisbyberi commented 1 year ago

@imgurbot12 Importing this particular module from Python is equivalent to building its Python C Extensions known as "pyexpat" in Codon. This is because Python C Extensions only function with the CPython interpreter. Essentially, this means that you will need to compile the entire CPython interpreter in Codon.

If you happen to have a pure Python implementation of this library, Codon can optimize it.

imgurbot12 commented 1 year ago

That sounds less than ideal. Doesn't that just mean importing it directly is an even worse idea? Lol

elisbyberi commented 1 year ago

@imgurbot12 Yes, that is correct. I have tried implementing some pure Python algorithms in Codon, and the speed is comparable to C, and perhaps even better.

imgurbot12 commented 1 year ago

So it sounds like the best alternative would be to write an xml parser from scratch and include it as part of a new version of the standard library?

elisbyberi commented 1 year ago

@imgurbot12 Yes, that is the spirit of Codon. It will leverage the full power of Codon.

imgurbot12 commented 1 year ago

I have a very basic tokenizer complete, working on building out a complete parser. Code is available here if anyone would like to contribute: https://github.com/imgurbot12/codon-xml

My intention is to first build it in standard python and then convert it to use codon's extra features, but I have quite a few questions about implementation.

First, is there a way to implement C-Like (or perhaps even better Rust-Like) Enums in python with Codon?

Second, I noticed that the typing library is completely empty. I couldn't find any docs related to the Optional type. Is there support for the <type> | None syntax?

Last, I noticed in the docs that there weren't many instances where a return type was explicitly declared. It seems pretty obvious that whatever compilation is going on infers the type based on the code and what it returns, but will it be able to handle processing a function with multiple returns? Is there even a reason or advantage to explicitly stating the return?

elisbyberi commented 1 year ago

@imgurbot12 Everything is the same as in Python, except for static typing. Here are some examples that address a few of your questions:

class Klass:

    def __repr__(self):
        return 'Hello klass!'

    def sum(self, a, b):
        return a + b

# Return Union
def initialize(a=None) -> Union:  # a is string or None
    match a:
        case 'bool': return True
        case 'int': return 1
        case 'float': return 0.11
        case 'str': return 'string'
        case 'list': return [1, 2, 3, 4, 5]
        case 'dict': return {'one': 1, 'two': 2}
        case 'Klass': return Klass()
        case _: return 'type not supported'  # or you may throw exception

print('initialize(\'list\'): ', initialize('list'))
print('initialize(): ', initialize())

# Optional[type]
a = List[Optional[Klass]]()
a.append(Klass())
a.append(None)
print('List[Optional[Klass]](): ', a)

# C enum (that it's; no more, no less)
Mon = 0
Tue = 1
Wed = 2
Thur = 3
Fri = 4
Sat = 5
Sun = 6
print('C enum: ', Mon, Tue, Wed, Thur, Fri, Sat, Sun)

# C++ enum (use tuple, because Codon dict is unordered)
@tuple
class Week:
    Mon = 0
    Tue = 1
    Wed = 2
    Thur = 3
    Fri = 4
    Sat = 5
    Sun = 6

print('Week.Mon: ', Week.Mon)

# This emulates tagged union (Rust enum)
@tuple
class Enum:
    Constant = 1
    Klass = Klass()

print('Enum.Constant: ', Enum.Constant)
print('Enum.Klass.sum(1, 2): ', Enum.Klass.sum(1, 2))

# <type> | None is not supported yet
def optional(string: Optional[str] = None) -> Optional[str]:
    return string

print('optional(\'optional string\'): ', optional('optional string'))
print('optional(): ', optional())

# class reference
def new(a: type = Klass) -> Klass:
    return a()

print('new(Klass): ', new(Klass))

# callable reference (undocumented)
def function(i: int) -> int:
    return i + 1

f: Callable[[int], int] = function
print('f(1): ', f(1))
imgurbot12 commented 1 year ago

Good to know thanks for the reference. Progress is steady on the XML lib. I have to build a lexer/parser for XPATH as well for the search features lol.

imgurbot12 commented 1 year ago

Hello, I tried to begin my port to codon after writing most of the base library, but I came across a few issues I'm not sure how to handle.

First: using references before they're defined within the same file. For example:

def get_a():
   return A()

class A:
    pass

print(get_a())

returns test.codon:2:11-12: error: name 'A' is not defined This same issue occurs when importing a library at bottom of the file to avoid circular imports. Example:

# file1.codon:
from file2 import func_c

def func_a():
  func_c()
  print('Hello World!')

# file2.codon:
def func_b():
  func_a()

def func_c():
  print('Function C!')

from file1 import func_a

Second, I'm not sure how Codon handles strings but the lack of a bytes or bytearray type makes it difficult to work with them. Each have their own specific uses separate from a standard string type and the bytearray especially makes it easy to append to like a list while still acting as a standard bytes object.

At the moment, these two things are keeping me from progressing any further on converting the library since I can't compile anything at the moment. Any help or workaround would be appreciated thanks :)

elisbyberi commented 1 year ago

@imgurbot12 The Codon global scope is parsed from top to bottom, which means that you cannot use a variable before declaring it:

class A:  # A must exists before using it in get_a()
    def __repr__(self):
        return 'This is class A.'

def get_a():
    return A()

print(get_a())

The same rule applies for avoiding circular importing - variables must be declared before they are used:

# file1.codon:
def func_file1():   # this function must exists before this line:  from file2 import func_file2
    print('func_file1')

def main():
    from file2 import func_file2

    func_file2()

    func_file1()

main()
# file2.codon:

def func_file2():    # this function must exists before this line:  from file1 import func_file1
    print('func_file2')

def main():
    from file1 import func_file1

    func_file1()

    func_file2()

main()

Regarding bytes and bytearray in Python, we can utilize Codon List since they are essentially lists of unsigned 8-bit integers. To replicate their functionality in Codon, we will continue our work in the codon-xml repository. I have been waiting for a stable state of the library before commencing work on type hinting. In the meantime, I will work on emulating bytes and bytearray in Codon.

imgurbot12 commented 1 year ago

The scope parsing from top to bottom seems rather restrictive but I understand that implementing something more complicated is likely a difficult process. I'll see if it's possible to reorganize the project to avoid import loops then, but it will become a bit tricky when objects depend or re-use one another.

As for the typehints and bytes/bytearray, I look forward to hearing from you. Thanks!

imgurbot12 commented 1 year ago

I'm going to close this issue. This project ultimately inspired the creation of my new pyxml library which was originally intended to be ported into a codon implementation of the stdlib xml library, but my motivation has waned due to my frustration with codon's lack of proper compatibility with python3's standard type annotations which seem to be mutually exclusive with the way codon operates.

As a massive lover of typehints, my hope was that codon would be taking the existing typehints in all of my projects and making use of the extra information provided to make compiled optimizations on them but that doesn't seem to be the way codon ultimately behaves or what is intended, which is a shame.

Ultimately, anyone is free to take my pure python implementation and use it as a reference or outright use it to implement it into codon's stdlib, but I'm concluding my efforts until typehints are properly supported if that ever happens. Best of luck to the codon team and contributers!

arshajii commented 1 year ago

Hi @imgurbot12 — appreciate the feedback, and sorry to hear that. We are still working on closing the gap with Python and there are definitely a few places that need more attention.

If possible could you please let me know what specific things/incompatibilities gave you trouble? I want to see if/how we can address them.

arshajii commented 1 year ago

Looking through the repo you linked… I can see a couple cases we don’t have support for yet but overall this definitely looks like the type of thing that should work in Codon.

Also mentioning @inumanag so he can take a look.

imgurbot12 commented 1 year ago

Hi @imgurbot12 — appreciate the feedback, and sorry to hear that. We are still working on closing the gap with Python and there are definitely a few places that need more attention.

If possible could you please let me know what specific things/incompatibilities gave you trouble? I want to see if/how we can address them.

@arshajii, thank you for your response. Happy to include some of the things I noticed and had difficulty with during my implementation efforts. I've ranked everything I noticed in order from what I would consider most important to least:

Hope that helps. Thanks again :)