apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[Website] Improve project description #44474

Closed ianmcook closed 2 days ago

ianmcook commented 6 days ago

Currently the Apache Arrow project descriptions that appear prominently at the top of the website and GitHub repo do not match and have not been updated in quite some time. Currently the description on the website is:

A cross-language development platform for in-memory analytics

and the description on GitHub is:

A multi-language toolbox for accelerated data interchange and in-memory processing

Given the immense growth in the adoption of Arrow that has occurred since we last updated these descriptions, and the current status of the Arrow format as a de facto standard with no directly comparable alternatives, I think it would be appropriate for us to be somewhat bolder in how we introduce the project. I also think that the description should include some mention of the fact that Arrow is a format in addition to a toolbox. And I think we should prefer simpler words ("fast" over "accelerated"; "toolbox" over "development platform).

Following this rationale, I propose that we change the description on both the website and GitHub to:

The universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Thoughts?

Component(s)

Website

kou commented 5 days ago

+1

(We may want to add "zero-copy" as a columnar format modifier.)

ianmcook commented 4 days ago

(We may want to add "zero-copy" as a columnar format modifier.)

I agree that it's important to highlight the fact that Arrow can enable zero-copy data interchange. But it might be difficult to incorporate "zero-copy" into this succinct description in a way that is accurate. Many successful applications of Arrow for data interchange are not truly "zero-copy"; instead they minimize the number of copies made while eliminating slow and computationally expensive data serialization/deserialization and transposition steps. But that's too many words to say in a succinct description. So I think we might be better off explaining this in other text below the description (which we already do to some extent, although maybe it could be improved).

kou commented 4 days ago

It makes sense.

kou commented 2 days ago

Issue resolved by pull request 44492 https://github.com/apache/arrow/pull/44492