feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.62k stars 1k forks source link

fix: Fixed SparkSource docstrings so it wouldn't used inhereted class docstrings #4722

Closed dandawg closed 2 weeks ago

dandawg commented 3 weeks ago

What this PR does / why we need it:

This PR adds a docstring to the SparkSource class and its __init__ method.

The current behavior is that SparkSource inherits the docstrings from the DataSource class. This is wrong, as the DataSource class has different functionality and parameters than does SparkSource.

When calling help(SparkSource) or inspect.getdoc(SparkSource), the DataSource docstring is (wrongfully) returned. Also, when calling SparkSource.__doc__ or SparkSource.__init__.__doc__ directly, no docstring is returned.

In [1]: from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource
In [2]: import inspect
In [3]: print(inspect.getdoc(SparkSource))
DataSource that can be used to source features.

Args:
    name: Name of data source, which should be unique within a project
    timestamp_field (optional): Event timestamp field used for point-in-time joins of
        feature values.
    created_timestamp_column (optional): Timestamp column indicating when the row
        was created, used for deduplicating rows.
    field_mapping (optional): A dictionary mapping of column names in this data
        source to feature names in a feature table or view. Only used for feature
        columns, not entity or timestamp columns.
    description (optional) A human-readable description.
    tags (optional): A dictionary of key-value pairs to store arbitrary metadata.
    owner (optional): The owner of the data source, typically the email of the primary
        maintainer.
    date_partition_column (optional): Timestamp column used for partitioning. Not supported by all offline stores.

^ There are args missing! They differ from the args in the source code and in the reference documentation.

With the update from this PR:

In [1]: from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource
In [2]: import inspect
In [3]: print(inspect.getdoc(SparkSource))
A SparkSource object defines a data source that a Spark offline store can use
In [4]: print(SparkSource.__init__.__doc__)
Creates a SparkSource object.

        Args:
            name: The name of the data source, which should be unique within a project.
            table: The name of a Spark table.
            query: The query to be executed in Spark.
            path: The path to file data.
            file_format: The format of the file data.
            created_timestamp_column: Timestamp column indicating when the row
                was created, used for deduplicating rows.
            field_mapping: A dictionary mapping of column names in this data
                source to feature names in a feature table or view.
            description: A human-readable description.
            tags: A dictionary of key-value pairs to store arbitrary metadata.
            owner: The owner of the DataSource, typically the email of the primary
                maintainer.
            timestamp_field: Event timestamp field used for point-in-time joins of
                feature values.

Which issue(s) this PR fixes:

Misc

dandawg commented 3 weeks ago

I've found other data sources with similar issues. I figured I would submit this one first to see if there are any patterns I need to follow. Once everything is good, I intend to fix the others.