apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.53k stars 759 forks source link

[DISCUSSION] Proposal move `object_store` to its own github repo? #6183

Open alamb opened 2 months ago

alamb commented 2 months ago

Which part is this question about This guthub repository contains an implementation of arrow and parquet and object_store which are related but are in separate crates and reasonably could be in separate repos

. arrow and parquet are still released in lockstep (have to be released at the same time, use the same Arrow voting thread)

However, object_store is released on a different schedule, with a different voting thread, a diffrent process, and has a non-trivally different set of maintainers and a substantial number of other users

While we have tags to separate issues and PRs of object_store I still find it confusing that this repo has content related to object_store

I believe the reason object_store is in the arrow-rs repo in the first place was convenience for the maintainers after it was first donated: https://github.com/apache/arrow-rs/issues/2030

Now that we have settled the API down and its development and release cycle becomes decoupled from the other crates in this repo I think the overhead of keeping it in the same crate is greater than the value we get from keeping it in the same one

Describe your question

  1. Do you think we move the object_store crate and associated Dev process (tickets, etc) to its own repository
  2. If so, to which one (perhaps apache/arrow-rs-object-store )
  3. Are you willing to help 🎣 make it happen?

Additional context

cc @tustvold and @crepererum

Xuanwo commented 2 months ago

Hi, @alamb, thanks a lot for raising this discussion.

Do you think we move the object_store crate and associated Dev process (tickets, etc) to its own repository

Most of my contributions, both in coding and reviewing, were focused on object_store. I feel that object_store is distinct from other arrow-related crates. I believe it would be beneficial to move object_store and its development process to a separate repository.

If so, to which one (perhaps apache/arrow-rs-object-store )

Since object_store is a subproject of arrow-rs, the natural name for it would be apache/arrow-rs-object-store.

Are you willing to help 🎣 make it happen?

I'm willing to help do this.

But I'm guessing we need:

alamb commented 2 months ago

But I'm guessing we need:

I agree -- thank you. I happen to be both a PMC member of Arrow / committer so I have the relevant permissions.

As long as there are no concerns I'll send a note to the arrow list in a few days

A committer to move all object_store related issues/discussions to new repo

I recently learned about "collaborators" in asf.yml https://github.com/apache/infrastructure-asfyaml?tab=readme-ov-file#triage where we could give someone rights to manage issues so in theory someone else could also help with thi

alamb commented 2 months ago

I'm willing to help do this.

(thank you!)

Xuanwo commented 2 months ago

I recently learned about "collaborators" in asf.yml https://github.com/apache/infrastructure-asfyaml?tab=readme-ov-file#triage where we could give someone rights to manage issues so in theory someone else could also help with thi

I'm happy to help with this.

ONE question: Do we need to migrate already closed issues/discussions?

tustvold commented 2 months ago

Since object_store is a subproject of arrow-rs, the natural name for it would be apache/arrow-rs-object-store.

I think given it lives under the arrow PMC and not a separate arrow-rs PMC, I think it would be arrow-object-store. This would be consistent with DataFusion which was arrow-datafusion.

I think the overhead of keeping it in the same crate is greater than the value we get from keeping it in the same one

Perhaps we could articulate what these overheads are? There might be ways to alleviate them that require less effort/churn

Do you think we move the object_store crate and associated Dev process (tickets, etc) to its own repository

I don't feel very strongly either way, I think it really depends what the long-term goals for the project are. My vision was for it to parallel the arrow filesystem abstractions present in arrow-cpp / pyarrow. This would naturally entail it residing long-term as part of the arrow project, and being largely developed within that context. With this as the goal splitting it into a separate repository seems like a lot of work and disruption for relatively minor return, if not actively detrimental to that goal. If there is instead an initiative to split it out into its own top-level apache project, splitting it out into a new repository would be a natural next step.

alamb commented 2 months ago

Perhaps we could articulate what these overheads are? There might be ways to alleviate them that require less effort/churn

In my mind it is:

  1. The release scripts for arrow / object_store require filtering on tags / are close but not quite the same
  2. The tag name convention for object store is annoying
  3. When I look at the list of issues in arrow-rs I also see a bunch of object store things which is distracting

We could certainly make this better by making separate release documentation and I could add some filters on tags to my view.

With this as the goal splitting it into a separate repository seems like a lot of work and disruption for relatively minor return, if not actively detrimental to that goal. If there is instead an initiative to split it out into its own top-level apache project, splitting it out into a new repository would be a natural next step.

The "what is the vision of the project" is a great discussion.

I agree if the vision is that object_store will mostly be used by arrow-rs and related technologies (like DataFusion) splitting into a new repo would likely be more detrimental

An alternate vision for object_store is a Rust ecosystem wide library for interacting with object stores, where arrow-rs / DataFusion were one of many downstream projects.

The reverse dependencies at the moment would suggest DataFusion and related projects are the largest users at the moment: https://crates.io/crates/object_store/reverse_dependencies

crepererum commented 2 months ago

I'm not super involved w/ release cutting and admin things of object_store, so this is mostly a user/contributor PoV: it's rather strange that object_store in the arrow repo. I know the historical cause of this, but it is fully independent. Having a dedicated repo would IMHO make the following things easier:

However I also understand that an extra repo may come w/ extra overhead.

I think my answer is mostly in line with @alamb.

alamb commented 2 months ago

Another annoyance of mine is that that the commit history is intermixed

Specifically, if you look at https://github.com/apache/arrow-rs/commits/master/

it is not clear which commits modify object_store and which modify arrow

tustvold commented 2 months ago

I think if there is a group of people willing to undertake the work of splitting it out, and to foster, maintain and build a community around said project, I think that is a very exciting path forward. My concern would be that it gets split out without a very clear community around it, and this then hampers its ongoing development and maintenance. I'm especially wary given my reduced capacity going forwards, which would leave the repository with very few active maintainers.

it is not clear which commits modify object_store and which modify arrow

One could make the argument that is what the changelogs are for, but I take your point

edmondop commented 2 months ago

+1

andygrove commented 2 months ago

1) +1 for moving to a new repo 2) the suggested repo name LGTM 3) I don't have bandwdth to help with this, and I am not a regular contributor to arrow-rs or object-store

alamb commented 2 months ago

I think if there is a group of people willing to undertake the work of splitting it out, and to foster, maintain and build a community around said project, I think that is a very exciting path forward.

I am willing to help with this

I'm especially wary given my reduced capacity going forwards, which would leave the repository with very few active maintainers.

I agree this is is a risk.

However, I think it may actually be that moving to a new repo makes it easier to attract new contributors and maintainers (I am thinking about @Xuanwo for example who has been happy to help) -- I think the number of other things in the arrow-rs repo could make the barrier to contribution of object_store higher

crepererum commented 2 months ago

I can also help with maintainance

Xuanwo commented 2 months ago

Yes 🙌, I'm willing to help build this project

edmondop commented 2 months ago

Also happy to help

alamb commented 2 months ago

Awesome. It seems as if we have reached consensus here that moving to a new repo will be a good idea.

What I would like to to is to complete the current release 0.11.0 https://github.com/apache/arrow-rs/issues/6121 using the existing process and then work to migrate to a new repo

It seems to me the only potentially unresolved issue is the new name for the repo. Here are the options so far

I'll also send a note to the dev list asking about this too from the broader community.

edmondop commented 2 months ago

Awesome. It seems as if we have reached consensus here that moving to a new repo will be a good idea.

What I would like to to is to complete the current release 0.11.0 #6121 using the existing process and then work to migrate to a new repo

It seems to me the only potentially unresolved issue is the new name for the repo. Here are the options so far

I'll also send a note to the dev list asking about this too from the broader community.

Is the arrow prefix something that we want to keep, or need to keep? I found confusing that the crate doesn't have anything to do with columnar storage, I was reading here the docs https://docs.rs/object_store/latest/object_store/ and the word "arrow" wasn't mentioned even once

alamb commented 2 months ago

Is the arrow prefix something that we want to keep, or need to keep? I found confusing that the crate doesn't have anything to do with columnar storage, I was reading here the docs https://docs.rs/object_store/latest/object_store/ and the word "arrow" wasn't mentioned even once

Yes, we need to use the arrow- prefix as this repo is in the apache organization and under the Arrow project

It is a reasonable (though larger discussion) if there is a better organizational home for this crate

alamb commented 2 months ago

Here is another reason to move to a new repo. The Full Changelog link in the changelog https://github.com/apache/arrow-rs/blob/master/object_store/CHANGELOG.md

https://github.com/apache/arrow-rs/compare/object_store_0.10.1...object_store_0.10.2

Shows commits from both object_store and arrow which is confusing

Xuanwo commented 2 months ago
  • apache/arrow-rs-object-store(suggested in the original description)

I like the idea of arrow-rs-object-store. It's clearly stated that object_store is from arrow-rs.

Xuanwo commented 2 months ago

Is the arrow prefix something that we want to keep, or need to keep? I found confusing that the crate doesn't have anything to do with columnar storage, I was reading here the docs docs.rs/object_store/latest/object_store and the word "arrow" wasn't mentioned even once

Hi, it's a pity that GitHub doesn't support multi-level project layouts, so we have to use prefixes to indicate PMC ownership. However, it is still possible for us to be elevated to a top-level project once we are mature enough. We can use apache/object_store at that time :heart:

alamb commented 1 month ago

Another annoyance while creating the arrow/parquet release candidate is that I had to filter the object_store related closed issues such as https://github.com/apache/arrow-rs/issues/6122 by tagging them with object_store

ByteBaker commented 3 weeks ago

I have a fair understanding of object_store. I can help on this too.

I see that Object Store Python already exists, so we can't use the name, but do we want a Python wrapper for this, just like we have for Arrow?

alamb commented 2 weeks ago

I see that Object Store Python already exists, so we can't use the name, but do we want a Python wrapper for this, just like we have for Arrow?

I don't know if any usecase for such an API, but maybe @roeap would know better as the owner of Object Store Python

ByteBaker commented 2 weeks ago

I have asked him on his repo. Let's see what he says.

kylebarron commented 1 week ago

I don't know if any usecase for such an API

Probably best to discuss on a separate issue, but there's quite a lot of interest in exposing object-store to Python. Both as a user-facing Python API and as a way to construct ObjectStore instances for Rust-Python libraries that want to do their own data fetching.

criccomini commented 1 week ago

Just my 2c, but this would be great! (moving to its own rep)