This PR adds a docstring to the SparkSource class and its __init__ method.
The current behavior is that SparkSource inherits the docstrings from the DataSource class. This is wrong, as the DataSource class has different functionality and parameters than does SparkSource.
When calling help(SparkSource) or inspect.getdoc(SparkSource), the DataSource docstring is (wrongfully) returned. Also, when calling SparkSource.__doc__ or SparkSource.__init__.__doc__ directly, no docstring is returned.
In [1]: from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource
In [2]: import inspect
In [3]: print(inspect.getdoc(SparkSource))
DataSource that can be used to source features.
Args:
name: Name of data source, which should be unique within a project
timestamp_field (optional): Event timestamp field used for point-in-time joins of
feature values.
created_timestamp_column (optional): Timestamp column indicating when the row
was created, used for deduplicating rows.
field_mapping (optional): A dictionary mapping of column names in this data
source to feature names in a feature table or view. Only used for feature
columns, not entity or timestamp columns.
description (optional) A human-readable description.
tags (optional): A dictionary of key-value pairs to store arbitrary metadata.
owner (optional): The owner of the data source, typically the email of the primary
maintainer.
date_partition_column (optional): Timestamp column used for partitioning. Not supported by all offline stores.
^ There are args missing! They differ from the args in the source code and in the reference documentation.
With the update from this PR:
In [1]: from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource
In [2]: import inspect
In [3]: print(inspect.getdoc(SparkSource))
A SparkSource object defines a data source that a Spark offline store can use
In [4]: print(SparkSource.__init__.__doc__)
Creates a SparkSource object.
Args:
name: The name of the data source, which should be unique within a project.
table: The name of a Spark table.
query: The query to be executed in Spark.
path: The path to file data.
file_format: The format of the file data.
created_timestamp_column: Timestamp column indicating when the row
was created, used for deduplicating rows.
field_mapping: A dictionary mapping of column names in this data
source to feature names in a feature table or view.
description: A human-readable description.
tags: A dictionary of key-value pairs to store arbitrary metadata.
owner: The owner of the DataSource, typically the email of the primary
maintainer.
timestamp_field: Event timestamp field used for point-in-time joins of
feature values.
I've found other data sources with similar issues. I figured I would submit this one first to see if there are any patterns I need to follow. Once everything is good, I intend to fix the others.
What this PR does / why we need it:
This PR adds a docstring to the
SparkSource
class and its__init__
method.The current behavior is that
SparkSource
inherits the docstrings from theDataSource
class. This is wrong, as theDataSource
class has different functionality and parameters than doesSparkSource
.When calling
help(SparkSource)
orinspect.getdoc(SparkSource)
, theDataSource
docstring is (wrongfully) returned. Also, when callingSparkSource.__doc__
orSparkSource.__init__.__doc__
directly, no docstring is returned.^ There are args missing! They differ from the args in the source code and in the reference documentation.
With the update from this PR:
Which issue(s) this PR fixes:
Misc