data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
103 stars 20 forks source link

Row ordering design choice #356

Open anmyachev opened 6 months ago

anmyachev commented 6 months ago

Why do I even think it is necessary to maintain order?

This matches the definition of dataframes from the article. If we take the approach of defining based on articles about dataframes and their algebra, then we can also look for new articles and do a more in-depth comparative analysis (since the article I cited is from 2020).

Are there any use cases where this is important?

I think it’s obvious that there are workloads for which the order of the data is important. For example, values were recorded in some area over time, without recording timestamps, to reduce the size of the dataset. Any use of operations that violate the order invalidates the trends that can be obtained from these data.

Why not come up with a new concept that has characteristics of both dataframes and relational tables?

For ease of DataFrame API adaptation, it seems that all that is needed is to more or less successfully combine current concepts that will conveniently coexist in one interface (at least for first stable release). With this approach, libraries belonging to one of these groups may need to implement the characteristics of another group. In the case of a new concept, the number of other characteristics groups may increase to two.

Solution.

Based on the fact that these two concepts have existed for a long time and have not been completely united during this time, and that at the moment there are many hybrids that implement the interface of the opposite group using their own basis of operations, I believe that the solution should not be ideal, but just quite flexible.

So let's allow the order to be preserved or not, based on the user's choices, be it additional function parameters, environment variables, or context managers.

This way there will be enough flexibility in relation to libraries that implement the relational approach (they will also be performant, since there will be no need to maintain order using an additional index column or other tricks) and at the same time, a greater number of user cases will be covered by the standard.

kkraus14 commented 6 months ago

This matches the definition of dataframes from the article. If we take the approach of defining based on articles about dataframes and their algebra, then we can also look for new articles and do a more in-depth comparative analysis (since the article I cited is from 2020).

This paper was brought up early in this effort and I believe there was consensus that the article is one person's / group's definition of dataframes but is not a universal definition of dataframes and we did not want to follow all of the semantics defined within it.