ImagingDataCommons / idc-index-data

Python package providing the index to query and download data hosted by the NCI Imaging Data Commons
MIT License
1 stars 4 forks source link

cd: create a gha that generate an index for each query in the src/sql folder #5

Closed vkt1414 closed 5 months ago

vkt1414 commented 5 months ago

This PR aims to address part of #2.

jcfr commented 5 months ago

Well done moving forward, I will review and help out tmrw.

fedorov commented 5 months ago

I will spend more time reviewing, but few comments for now:

In light of the reogranization of the repositories by @jcfr, I think we should revisit the original plan outlined in https://github.com/ImagingDataCommons/idc-index-data/issues/2#issuecomment-1989574256.

In particular, as I understand it, there is no need anymore to attach index files as release artifacts. They can be uploaded directly to PyPI as part of the package. This would also make it unnecessary to create an issue

Maybe we should have one workflow with manual trigger to generate results of running the query, as you did it - this will be useful to verify everything works as expected before preparing the package - and then another one that would be triggered on the creation of the release (as done in idc-index) that will upload the package to PyPI?

jcfr commented 5 months ago

Suggestions:

jcfr commented 5 months ago

This first version should package v17 without parquet, once this is working, we will then release the version 17.0.0 on PyPI.

After that, we will add the bump-version.yml workflow to effectively detect that a new version if available and release it as well.

fedorov commented 5 months ago

After that, we will add the bump-version.yml workflow to effectively detect that a new version if available and release it as well.

Can you clarify what you mean by this? Effectively detect new version of what and in which package?

jcfr commented 5 months ago

Can you clarify what you mean by this? Effectively detect new version of what and in which package?

The bump-idc-data-index-version.yml workflow would be scheduled to run at given frequency (daily, weekly or monthly), it would then do the following:

fedorov commented 5 months ago

The bump-idc-data-index-version.yml workflow would be scheduled to run at given frequency (daily, weekly or monthly), it would then do the following:

I see. We considered this approach before, but at this point I do not believe this is a good approach. First, IDC releases do not happen frequently - once every 1-3 months. So there is no significant burden to respond to IDC version updates manually. Second, new IDC releases will have new data or schema updates, that may break assumptions, or produce index that is too large, or I do not know what else that we can't anticipate. Because of this, it is safer to manually test the queries, and consider the result of the query before publishing.

I would prefer to have:

  1. a manually-triggered workflow that would run the queries and generate artifacts for examination and release preparation
  2. automatic workflow that would be triggered on creation of a release that would generate the index based on the current content of the queries and publish the package on PyPI
vkt1414 commented 5 months ago

@jcfr I was finally able to call the datamanager directly from cmake, cleaned up data manager and cd heavily. I realized I was very far from your vision about using cmake to handle index generation but am also slowly understanding how pyproject.toml, cmake, and hynek/build-and-inspect-python-package gha combination are working. Thanks for your patience. The latest iteration addresses most of not all of the things Andrey would like. I hardcoded to generate both CSV and parquet as I realized we will be hardcoding even if we use cmake lists. But with somehelp, we can fix it if we should ideally parameterize cmake instead of the python file

latest successful run: https://github.com/vkt1414/idc-index-data/actions/runs/8366868268

vkt1414 commented 5 months ago

One more update: After including the GCP authentication step, CI is also working on all combinations from python 3.8-.12 on latest mac, ubuntu, and windows. However pypi 3.10 on latest ubuntu did not work and logs did help to troubleshoot, so I disabled pypi-3.10 for now. latest CI run: https://github.com/vkt1414/idc-index-data/actions/runs/8376050860

fedorov commented 5 months ago

For the reference, this PR was needed to fix the CI failure due to inability to access GHA secrets from the PR submitted from a fork of the repo: https://github.com/ImagingDataCommons/idc-index-data/pull/7.

fedorov commented 5 months ago

The issue with the current approach is that wheels are not reproducible and the content of the index file being generated depend on time at which they are generated.

@jcfr this is a very good point. I did not consider that it is important for the wheel packaging process to be reproducible.

revisit the versioning from being "number" based to be date based (e.g YYYY.MM.DD.N)

The current approach of using IDC data release version as the major version of the package is exactly the purpose of enabling reproducibility. I would argue that mapping the date to the IDC release version is not trivial. But if the user knows the IDC data version, this is all that is needed to make queries reproducible.

Thinking about this a bit more, we could modify the queries to have, for example {idc_versioned_dataset} in place of idc_current, and replace that placeholder with the IDC version based on the idc-index-data major release number. So, for example, if the user generates release tag 17.2.10, GHA workflow would pick 17, replace the dataset name in each of the queries with idc_v17, run queries, and package+publish the result into idc-index-data v17.2.10.

jcfr commented 5 months ago

for example {idc_versioned_dataset} in place of

Indeed, that would be similar to the use of @IDC_TABLENAME@ discussed in https://github.com/ImagingDataCommons/idc-index-data/pull/5#discussion_r1527111484

For convenience, supporting --output-sql-query would also be nice to have.

version

Extracting the major version and map it to the name of the table is an option. For this to work, we would need to also customize the versioning with the following settings:

[tool.setuptools_scm]
version_scheme = "no-guess-dev"

The "challenge" with relying only on the tag is that we would have no way of testing the generation of a wheel associated with a new version of the index by simply creating a pull request, indeed a new tag and release would have to be created first without if it actually work.

To address this I would instead suggest to hard-code the major version in the CMakeLists.txt as we originally discussed.

vkt1414 commented 5 months ago

The "challenge" with relying only on the tag is that we would have no way of testing the generation of a wheel associated with a new version of the index by simply creating a pull request, indeed a new tag and release would have to be created first without if it actually work.

To address this I would instead suggest to hard-code the major version in the CMakeLists.txt as we originally discussed.

If all of the solutions we discussed involve manually updating the version, can't we simply hardcode the idc version in the sql query itself?

jcfr commented 5 months ago

If all of the solutions we discussed involve manually updating the version, can't we simply hardcode the idc version in the sql query itself?

I have a draft of bump function for addressing this and allow updating files using nox -s bump (query the latest) or nox -s bump -- <version>

Before moving forward, we really need @fedorov to integrate https://github.com/ImagingDataCommons/idc-index-data/pull/8 as none of the code in this pull request is being tested.

fedorov commented 5 months ago

@jcfr that PR is integrated now, please go ahead or let us know what we should do to fix this.

fedorov commented 5 months ago

@jcfr I recall when we met last time you mentioned that one could handle packaging with just python, and CMake is not critical. Maybe you can comment on that a bit. I do not want to derail the current development, but also I do not know if learning CMake is a worthwhile investment of effort for Vamsi. Just wanted to raise the idea if he could understand the alternatives better.

jcfr commented 5 months ago

re: CMake vs Python

Ditto. A pure python plugin can be written to implement similar behavior, there is some hints in the "Discussions" section of one of our recent scikit-build meeting^1

Given my limited experience with such hatch plugin, I suggest we move forward with CMakeLists.txt and consider revisiting afterward. Revisiting later will be much easier as we will have a working system.

Also worth noting that I expect the CMakeLists.txt to remain small and be straightforward to maintain.

jcfr commented 5 months ago

re: permission error

For now, it suggest we create the branch feat-cd-2 directly into the repo and not from a fork.

jcfr commented 5 months ago

Closing. This pull request is superseded by https://github.com/ImagingDataCommons/idc-index-data/pull/9