apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
149 stars 34 forks source link

python: Schema repr should not get truncated #466

Open jorisvandenbossche opened 1 month ago

jorisvandenbossche commented 1 month ago

Looking at this example, I think it might be reasonable to truncate the schema repr when it is embedded in the Array repr, but then when inspecting the schema itself I would expected to see the full repr (or at least the truncation can be more relaxed, and show more by default):

In [29]: url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
    ...: array = na.ArrayStream.from_url(url).read_all()

In [30]: array
Out[30]: 
nanoarrow.Array<non-nullable struct<commit: string, time: timestamp('...>[15487]
{'commit': '49cdb0fe4e98fda19031c864a18e6156c6edbf3c', 'time': datetime.datet...
{'commit': '1d966e98e41ce817d1f8c5159c0b9caa4de75816', 'time': datetime.datet...
{'commit': '96f26a89bd73997f7532643cdb27d04b70971530', 'time': datetime.datet...
{'commit': 'ee1a8c39a55f3543a82fed900dadca791f6e9f88', 'time': datetime.datet...
{'commit': '3d467ac7bfae03cf2db09807054c5672e1959aec', 'time': datetime.datet...
{'commit': 'ef6ea6beed071ed070daf03508f4c14b4072d6f2', 'time': datetime.datet...
{'commit': '53e0c745ad491af98a5bf18b67541b12d7790beb', 'time': datetime.datet...
{'commit': '3ba6d286caad328b8572a3b9228045da8c8d2043', 'time': datetime.datet...
{'commit': '4ce9a5edd2710fb8bf0c642fd0e3863b01c2ea20', 'time': datetime.datet...
{'commit': '2445975162905bd8d9a42ffc9cd0daa0e19d3251', 'time': datetime.datet...
...and 15477 more items

In [31]: array.schema
Out[31]: <Schema> non-nullable struct<commit: string, time: timestamp('us', 'UTC'), fi...
paleolimbot commented 1 month ago

Good point! Maybe include the output of schema.fields by default (perhaps truncated to a reasonable number of lines?)

import nanoarrow as na

url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
schema = na.ArrayStream.from_url(url).schema

schema.fields
#> [<Schema> 'commit': string,
#>  <Schema> 'time': timestamp('us', 'UTC'),
#>  <Schema> 'files': int32,
#>  <Schema> 'merge': bool,
#>  <Schema> 'message': string]