ImagingDataCommons / idc-index-data

Python package providing the index to query and download data hosted by the NCI Imaging Data Commons
MIT License
1 stars 4 forks source link

Automate index updates and packaging #2

Open fedorov opened 9 months ago

fedorov commented 9 months ago

@vkt1414 I suggest we add 2 GitHub actions:

  1. Daily action that checks if there is an update to the IDC release BQ tables. If update is detected, it will
    1. run all queries in the queries folder, and save the result of each query as <query_prefix>.csv.zip in the "latest" release
    2. make a PR to update IDC version in https://github.com/ImagingDataCommons/idc-index/blob/main/idc_index/index.py#L65.
  2. Commit-triggered action that will look for release tags that follow our versioning pattern. When a tag release is detected, it will:
    1. create a GitHub release with the release tag
    2. attach indices from "latest" to the new release
    3. trigger PyPI package release
  3. Commit-triggered action that will
    1. run all the tests
    2. if queries are updated, re-run and update CSV in the latest release
  4. PR-triggered action that will
    1. if queries are updated, re-run the queries first and save resulting CSV in a place accessible during tests
    2. run all the tests (I think for this it will be beneficial to be able to run the test with the manually configured location of the table passed via the constructor)

What do you think? Did I miss anything?

vkt1414 commented 9 months ago

re task 1: How should we address if a pull request is not attended in a day?

fedorov commented 9 months ago

Good question! I think we should overwrite the branch corresponding to PR. Also, now that I think about it, the latest release and attachments should be commited only on merge, not when PR is created.

fedorov commented 6 months ago

Based on thinking about this and discussions, here's the revisited proposed behavior of the GHA for facilitating index updates:

  1. Manual trigger only for now
  2. Take all of the queries in the queries folder and run them
  3. create artifacts containing the result for each query saved as CSV and Parquet, files should be named consistently with the query file name and include IDC version in the file name by figuring out what idc_current maps to at the time the query is executed
  4. to get the number of the latest version of IDC, list all of the dataset in bigquery-public-data project and get latest idc_v*
  5. create an issue that will include links to the artifacts generated by the GHA, title "[github-action] Index updates IDC v..." (something like that)
  6. replace idc_current with the actual version in the query and save each of the queries as GHA artifact
fedorov commented 5 months ago

@vkt1414 we discussed this with JC, and with the new layout of the repositories, it makes sense to move queries from idc-index to this repo, and upload the resulting CSV/Parquet file to PyPI. We won't need to attach the zip file to the release.