apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
374 stars 78 forks source link

Add remaining non-wrapped functions #767

Open timsaucer opened 3 months ago

timsaucer commented 3 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We still have a few classes that do not yet have wrapper functions. Namely datafusion.object_store and datafusion.common. Additionally in datafusion.substrait we reference LogicalPlan that is not exposed.

Also worth doing is reviewing the excellent PR https://github.com/apache/datafusion-python/pull/751 to see how it now fits in with the updated python wrappers.

Describe the solution you'd like Add missing wrappers and validate namespace corrections

Describe alternatives you've considered None

Additional context This is follow on work to https://github.com/apache/datafusion-python/pull/750

Michael-J-Ward commented 3 months ago

Question: Have you ever used or do you know of a tool to run queries over python / rust codebases?

It would be nice if we could generate a concrete report of what is not exposed.

timsaucer commented 3 months ago

No, but I did write a small script to check and this is what I see missing:

Missing attribute. Object name: datafusion, Attribute name: Catalog
Missing attribute. Object name: datafusion, Attribute name: Database
Missing attribute. Object name: datafusion, Attribute name: ExecutionPlan
Missing attribute. Object name: datafusion, Attribute name: LogicalPlan
Missing attribute. Object name: datafusion, Attribute name: RecordBatch
Missing attribute. Object name: datafusion, Attribute name: RecordBatchStream
Missing attribute. Object name: datafusion, Attribute name: Table
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: runtime
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: Catalog
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: Database
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: Table
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: AggregateUDF
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: LogicalPlan
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: ExecutionPlan
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: RecordBatch
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: RecordBatchStream
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: common
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: expr
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: functions
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: object_store
Missing value in list. Object name: datafusion, Attribute name: __all__, Value: substrait
Missing attribute. Object name: datafusion.common, Attribute name: DFSchema
Missing attribute. Object name: datafusion.common, Attribute name: DataType
Missing attribute. Object name: datafusion.common, Attribute name: DataTypeMap
Missing attribute. Object name: datafusion.common, Attribute name: NullTreatment
Missing attribute. Object name: datafusion.common, Attribute name: PythonType
Missing attribute. Object name: datafusion.common, Attribute name: RexType
Missing attribute. Object name: datafusion.common, Attribute name: SqlFunction
Missing attribute. Object name: datafusion.common, Attribute name: SqlSchema
Missing attribute. Object name: datafusion.common, Attribute name: SqlStatistics
Missing attribute. Object name: datafusion.common, Attribute name: SqlTable
Missing attribute. Object name: datafusion.common, Attribute name: SqlType
Missing attribute. Object name: datafusion.common, Attribute name: SqlView
Missing attribute. Object name: datafusion.common, Attribute name: __all__
Missing attribute. Object name: datafusion.expr, Attribute name: EmptyRelation
Missing attribute. Object name: Expr, Attribute name: __radd__
Missing attribute. Object name: Expr, Attribute name: __rand__
Missing attribute. Object name: Expr, Attribute name: __rmod__
Missing attribute. Object name: Expr, Attribute name: __rmul__
Missing attribute. Object name: Expr, Attribute name: __ror__
Missing attribute. Object name: Expr, Attribute name: __rsub__
Missing attribute. Object name: Expr, Attribute name: __rtruediv__
Missing attribute. Object name: datafusion.expr, Attribute name: IsNull
Missing attribute. Object name: datafusion.expr, Attribute name: Unnest
Missing attribute. Object name: datafusion.expr, Attribute name: Window
Missing attribute. Object name: datafusion.expr, Attribute name: __all__
Missing attribute. Object name: datafusion.functions, Attribute name: __all__
Missing attribute. Object name: datafusion.object_store, Attribute name: AmazonS3
Missing attribute. Object name: datafusion.object_store, Attribute name: GoogleCloud
Missing attribute. Object name: datafusion.object_store, Attribute name: LocalFileSystem
Missing attribute. Object name: datafusion.object_store, Attribute name: MicrosoftAzure
Missing attribute. Object name: datafusion.object_store, Attribute name: __all__
Missing attribute. Object name: datafusion, Attribute name: runtime
Missing attribute. Object name: datafusion.substrait, Attribute name: __all__

Code to generate:

import datafusion
import datafusion.functions
import datafusion.object_store
import datafusion.substrait

def missing_exports(internal_obj, wrapped_obj):
    for attr in dir(internal_obj):
        if attr not in dir(wrapped_obj):
            print(f"Missing attribute. Object name: {wrapped_obj.__name__}, Attribute name: {attr}")
            continue
        internal_attr = getattr(internal_obj, attr)
        wrapped_attr = getattr(wrapped_obj, attr)
        if internal_attr is not None and wrapped_attr is None:
            print(f"Attribute exists but is None. Object name: {wrapped_obj.__name__}, Attribute name: {attr}")

        if attr in ["__self__", "__class__"]:
            continue
        if isinstance(internal_attr, list):
            for val in internal_attr:
                if val not in wrapped_attr:
                    print(f"Missing value in list. Object name: {wrapped_obj.__name__}, Attribute name: {attr}, Value: {val}")
        elif hasattr(internal_attr, '__dict__'):
            missing_exports(internal_attr, wrapped_attr)

missing_exports(datafusion._internal, datafusion)

I can work on adding these tomorrow morning and I can also add this code as a unit test.

timsaucer commented 3 months ago

FWIW I don't know if all of these need to be exported. It's probably worth looking through each one.