apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.37k stars 3.49k forks source link

[Python][Docs] `RecordBatch.from_pydict()` and `RecordBatch.from_pylist()` example code references `pa.Table` instead of `pa.RecordBatch` #41907

Open d33bs opened 4 months ago

d33bs commented 4 months ago

Describe the enhancement requested

Thanks for the fantastic work on Arrow and PyArrow! I noticed a minor typo in the docs for code examples with RecordBatch.from_pydict() and RecordBatch.from_pylist() which mention pa.Table.from_{x} instead of pa.RecordBatch.from_{x}. This may result in possible user confusion, needing to look up the source code, or try things out to make sure the Python API can handle the calls.

For example, under pa.RecordBatch.from_pydict():

Currently shows:

>>> my_metadata={"n_legs": "Number of legs per animal"}
>>> pa.Table.from_pydict(pydict, metadata=my_metadata).schema
n_legs: int64
animals: string
-- schema metadata --
n_legs: 'Number of legs per animal'

Should be:

>>> my_metadata={"n_legs": "Number of legs per animal"}
>>> pa.RecordBatch.from_pydict(pydict, metadata=my_metadata).schema
n_legs: int64
animals: string
-- schema metadata --
n_legs: 'Number of legs per animal'

The same stands for examples in pa.RecordBatch.from_pylist().

Thanks in advance for help with correcting this.

Component(s)

Documentation, Python

llama90 commented 4 months ago

Good catch. I'll try to fix it.

llama90 commented 4 months ago

After a quick look, it seems that the examples for Table and RecordBatch are being used redundantly. This is because a Table is made up of multiple RecordBatch objects, so they work with the same logic internally, and the documentation is generated through docstrings.

From a user perspective, without understanding the concepts (What is a Table? What is a RecordBatch?), or when copying and pasting code to use with other code, it might lead to unintended results.

On the other hand, from a development and documentation management perspective, the current form doesn’t seem incorrect.

What are your thoughts on this? cc @AlenkaF

AlenkaF commented 4 months ago

These docstrings were united between Table and RecordBatch when the parent class _Tabular was added to consolidate Table and RecordBatch functionality (see https://github.com/apache/arrow/issues/36129). In these consolidated methods docstrings there is a text added:

Table (works similarly for RecordBatch)

that should help with understanding why Table example is used for RecordBatch also. If you look in the RecordBatch class examples, you have the "correct" class usage.

For me this is quite clear but can imagine from a user's perspective it might be confusing. If you have a good proposal for clearer content in the consolidated methods examples, I am definitely open for improvements.

d33bs commented 4 months ago

Thanks @llama90 and @AlenkaF for looking into this! Perhaps this could benefit from Sphinx conditionals (or something similar, if Sphinx isn't used). I'm not certain if these would work within the context of docstrings and any other special configurations which may be at hand. Thoughts?

AlenkaF commented 4 months ago

We use Sphinx but I do not think Sphinx conditionals will work for code docstrings.