Quantco / pytsql

Run mssql scripts from Python.
BSD 3-Clause "New" or "Revised" License
14 stars 3 forks source link

Add C++ ANTLR runtime #28

Closed RokasUrbonasQC closed 2 years ago

RokasUrbonasQC commented 2 years ago

Adding C++ ANTLR code with bindings created using speedy-antlr-tool to significantly improve performance on large inputs.

Included changes:

Note for reviewers: All .cpp and .h files are autogenerated or directly taken from ATNLR repo.

ivergara commented 2 years ago

Is adding/vendor the ~300 files from the C++ parser really the only option?

RokasUrbonasQC commented 2 years ago

Is adding/vendor the ~300 files from the C++ parser really the only option?

Yes, I think so. ANTLR Python target is slow, and our grammar definition is massive.

ivergara commented 2 years ago

Is adding/vendor the ~300 files from the C++ parser really the only option?

Yes, I think so. ANTLR Python target is slow, and our grammar definition is massive.

I'm not saying that using a C++-based parser is the bad part, just vendoring it into the repository.

Would it be possible to use git submodules here? This would require that the C++ parser source code is also a git repository somewhere.

RokasUrbonasQC commented 2 years ago

Would it be possible to use git submodules here? This would require that the C++ parser source code is also a git repository somewhere.

I could create a separate repo that would hold just the C++ runtime and parser files. Then we could use it as a submodule in pytsql. Does this approach sound good?

Alternatively, it might be possible to have all the C++ and speedy-antlr-tool code as a separate Python and Conda package, that would be a dependency of pytsql. (I might even be able to make the C++ version an "extra" for the package (e.g. pip install pytsql[cpp-runtime])

RokasUrbonasQC commented 2 years ago

Would it be possible to use git submodules here? This would require that the C++ parser source code is also a git repository somewhere.

I could create a separate repo that would hold just the C++ runtime and parser files. Then we could use it as a submodule in pytsql. Does this approach sound good?

Alternatively, it might be possible to have all the C++ and speedy-antlr-tool code as a separate Python and Conda package, that would be a dependency of pytsql. (I might even be able to make the C++ version an "extra" for the package (e.g. pip install pytsql[cpp-runtime])

After discussion with @ivergara via Slack, I have decided to leave the C++ files as they are right now, as other approaches are too difficult to implement or maintain, while providing little benefit.

RokasUrbonasQC commented 2 years ago

Thanks a bunch for your work - in particular for extending the documentation and writing tests for explicitly comparing our Python and C++ parsing!

Do you have thoughts on what ought to happen with the Python parser? Are you aware of cases in which in might still prove useful?

The Python parser code just has to stay. It's still used in converting the C++ parse tree to the Python equivalent. I don't think there will be any other use for it.

I had an idea of using the Python parser code as a fallback in case no binary distribution exists for some specific OS + Python version combination. However, such cases are unlikely to occur with proper cibuildwheels and conda distribution workflows. I discussed this fallback idea briefly with @ivergara, and we decided that it would be too much effort to maintain.