apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.48k stars 1.37k forks source link

PARQUET-2495: Update merge script to python3 #1373

Open emkornfield opened 2 weeks ago

emkornfield commented 2 weeks ago

It is possible JIRA updates still have some work here, as it errored out while trying to do those.

Make sure you have checked all steps below.

Jira

Tests

Commits

Style

Documentation

wgtmac commented 2 weeks ago

I'm not an expert in python. Just want to ask a general question on whether we should enforce using the merge script for parquet-java (or other parquet-xxx projects). It has been broken for a long time.

@gszadovszky @julienledem Thoughts?

emkornfield commented 2 weeks ago

I think the main thing the script does (which I haven't fully tested) is allow for easy merging of backports, and JIRA closing (which given current ML thread might not be useful for super long).

gszadovszky commented 2 weeks ago

There is a description about it in the dev/README. It seems this is for the time when we were not able to simply push the PRs from github. Github was a read only mirror for a time being and we used the apache git repo to push the commits into.

I don't remember using this script. I think we do not need it anymore.

emkornfield commented 2 weeks ago

For reference there is a similar script in Arrow which I think probably has similar origins. The additional benefit even for github, is it looks like it tags issues with appropriate fix versions. Not sure if parquet-java has other ways of tracking this?

wgtmac commented 2 weeks ago

There is a similar script in Apache ORC: https://github.com/apache/orc/blob/main/dev/merge_orc_pr.py. It can help keep the PR description, update JIRA state and backport commit to other branches in a single shot. Perhaps we do not want to enforce using the script but it might help in the above cases.

Fokko commented 2 weeks ago

I haven't used this script and would prefer to see if we can have a more Github-oriented workflow. Rok just raised a vote to migrate the issues from Jira to GitHub.

emkornfield commented 1 week ago

@Fokko in your experience how have issues been tagged with fixes for GH oriented workflows?

emkornfield commented 5 days ago

@Fokko in your experience how have issues been tagged with fixes for GH oriented workflows?

@Fokko ping, want to understand what you where thinking here?

Fokko commented 4 days ago

@emkornfield Sorry for the late reply here, the mailbox is a bit swamped, thanks for the ping.

For Iceberg we make these Github releases (example for 1.5.0). Github will generate this list for you, and then the release manager groups and curates this a bit to make it easier to read. Merging would just go through the Github UI since we don't need to keep track of things in Jira, and this way we don't need to use the script.

emkornfield commented 3 days ago

@Fokko Thanks, I still think the Arrow issues which use the tool might keep better records.

For instance if you look at a recently resolved issue in Arrow you see it is tagged with a fix version (I believe this is due to using a script). I don't see something similar in a recent Iceberg. I might be missing something but I think the Arrow approach is nicer as it allows for me to directly find an issue and determine which release I might need to use to make use of it.

I also think being able to easily backport fixes to other branches via one step process is nicer even if it doesn't use the UI (but this is likely a rarer use-case).

Thoughts?

Fokko commented 3 days ago

In Iceberg we create backports by hand. We checkout the branch, and backport using cherry-picking the commits, and create a PR. Can I ask how you resolve merge conflicts with older versions?

wgtmac commented 3 days ago

In Iceberg we create backports by hand. We checkout the branch, and backport using cherry-picking the commits, and create a PR. Can I ask how you resolve merge conflicts with older versions?

In the merge script of Apache ORC, it asks which branch to cherry-pick the commit and tries to do that automatically. If it succeeds, then we are done. Or when it fails, it prints error message and asks the developer to resolve the cherry-pick conflict manually and then come back to the script by answering Y to the question it asked to proceed: https://github.com/apache/orc/blob/main/dev/merge_orc_pr.py#L128-L136

BTW, as I have asked in https://github.com/apache/parquet-java/issues/2932, it would be good to automatically close the Github issue related to the PR.

emkornfield commented 3 days ago

My biggest concern is properly tagging issues to milestones. I think the main question we should answer here is do we want a script. If we do, I think we should check this in and I can cherry-pick the necessary features from some of the other merge scripts (we should also ideally consolidate with the script in parquet-format with maybe convenience wrappers for running both)