ImagingDataCommons / idc-index

Python package to simplify access to the data available from NCI Imaging Data Commons
https://idc-index.readthedocs.io/
MIT License
11 stars 5 forks source link

list_indices and fetch_index implemented #101

Closed DanielaSchacherer closed 2 months ago

DanielaSchacherer commented 3 months ago

I made a suggestion on how to implement

There is certainly room for improvement, but we can use it as basis for further discussion.

fedorov commented 3 months ago

@DanielaSchacherer to help take care of the CI issues, you can install pre-commit hooks with pre-commit install in the repo folder. This will apply those checks and auto-corrections every time you try to commit locally, so you do not need to wait to see what happens with CI.

fedorov commented 3 months ago

Daniela, you didn't need to close the PR - all of those issues can be resolved by subsequent commits to the same branch.

fedorov commented 3 months ago

@DanielaSchacherer also here's a useful hint - if you are working on a PR and want to mark it as not ready for review, you can set it to "Draft" status - upper right corner of the page "Convert to draft" link.

image
DanielaSchacherer commented 3 months ago

Okay, sorry. I am not yet very familiar with CI tools in Github. Very useful tipps! Thanks!

fedorov commented 3 months ago

@DanielaSchacherer you will have to bear with one more tip!

As you go over small refinements, you will inevitably be adding commits that add small tweaks but do not bring any value for the code development history purposes (such as in your case last the last 4 commits).

In such cases, good practice is to squash those commits into the one commit. In your case, it would be git rebase -i HEAD~5 (you can read about interactive rebase here, for example https://itexus.com/glossary/git-rebase-interactive/ - and I am sure there are better/more comprehensive articles).

Also, the commit message has to be more informative, and, when PR corresponds to an existing issue, it should link to that issue (you can do it easily using github notation and just mention #97). After you squashed those 5 commits into 1, you can modify the commit message with git commit --amend. You would then need to force-push your branch to github with git push -f.

Finally, we use the convention to have prefix indicating the nature of the contribution. This is a good guide to follow: https://slicer.readthedocs.io/en/latest/developer_guide/contributing.html#how-to-write-commit-messages.

fedorov commented 3 months ago

@DanielaSchacherer I am happy to discuss the above on a call - I understand it can be confusing.

DanielaSchacherer commented 3 months ago

Note: I experienced a problem with accessing github assets, as there appears to be an API limit (error 403). That's why one of the test fails. Also as a note: I changed the code to only consider parquet files, instead of parquet and csv. Is that alright?

vkt1414 commented 3 months ago

Note: I experienced a problem with accessing github assets, as there appears to be an API limit (error 403). That's why one of the test fails.

There is a limit of 60 requests per hour.

https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#primary-rate-limit-for-unauthenticated-users

Also as a note: I changed the code to only consider parquet files, instead of parquet and csv. Is that alright?

CSV is too inefficient to revert to. Currently, all indices are generated in the parquet file format only.

fedorov commented 3 months ago

Definitely, no need to consider CSV.

fedorov commented 3 months ago

Note: I experienced a problem with accessing github assets, as there appears to be an API limit (error 403). That's why one of the test fails.

I believe it is failing because you are trying to access sm_index, which is indeed not installed. idc-index currently depends on idc-index-data, and specifically on version 18.0.1 of the latter, which does not contain sm_index release attachment. It was added in 18.1.0.

I am also now questioning whether we should install all of the parquet attachments blindly. If we do that, we need to think how to communicate the descriptions. It might be more practical to hard-code installation of the additional indices. It might be easier this way to explain to the users what they are.

DanielaSchacherer commented 2 months ago

I believe it is failing because you are trying to access sm_index, which is indeed not installed. idc-index currently depends on idc-index-data, and specifically on version 18.0.1 of the latter, which does not contain sm_index release attachment. It was added in 18.1.0.

I think you are right. Additionally, I encountered some API limits before when using specifically the latest idc-index-data version. I am trying to reproduce it now.

I am also now questioning whether we should install all of the parquet attachments blindly. If we do that, we need to think how to communicate the descriptions. It might be more practical to hard-code installation of the additional indices. It might be easier this way to explain to the users what they are.

I am not concerned so much about whether we download all parquet attachments blindly or on user's request. I am struggling more with how to explain available indices and their use (especially, which one is used when calling a function).

Regarding what you said in Slack:

I don't recall when the decision to download all assets was made.

Did we make that decision at all? I thought we made the decision to download them by request? But as said above, in either case we will need good descriptions on the intended use and I am not sure how to do that.

fedorov commented 2 months ago

Did we make that decision at all? I thought we made the decision to download them by request?

Sorry, I was not precise. What I meant is that right now the availability of indices is discovered on the fly by examining all release attachments that match a pattern. An alternative to this could be maintaining a list of known indices within the package, and interacting with GitHub only to retrieve (but not to discover) extra indices. Let's meet to discuss this @DanielaSchacherer.

fedorov commented 2 months ago

Looking at the CI logs, it is error 403

INFO     idc_index.index:index.py:183 Fetching the list of indices from idc-index-data 18.1.0 release.
ERROR    idc_index.index:index.py:199 Failed to fetch releases: 403

At this point, I would just switch to the fixed list of external indices. I personally do not see the value in debugging this. I would do that independently from this error.

DanielaSchacherer commented 2 months ago

@fedorov I'll give it a last try, but you are probably right.

fedorov commented 2 months ago

@DanielaSchacherer I forgot to mention it earlier, but I noticed you are working in the main branch of your fork. This, in my experience, is not a good idea, since usually you would want the main branch of your fork to track main upstream. This also creates challenges for collaborators, since they would need to deal with two main branches - yours and upstream.

Next time, I recommend you first make a dedicated branch for your PR.

This and earlier issues motivated me to add contribution guidelines to the repo - if you see anything missing there, please let me know: https://github.com/ImagingDataCommons/idc-index/blob/main/CONTRIBUTING.md.

fedorov commented 2 months ago

@DanielaSchacherer I resolved the conflicts locally, and rebased to the current main branch. I have no idea why the author has changed. I also have no idea why the tests are now failing .... To be safe, I copied your main branch here (before rebase and push into your current main) so you can compare the content: https://github.com/fedorov/idc-index/tree/100-main-daniela-backup.

Maybe you can take a look first, and then I will continue with my review? Or you can make a separate from main branch, we can close this PR and open a new one, if we think it will take some more time to iterate on this...

Sorry for not providing the guidance how to set up dev process from the start. This contributed a lot to the current confusion...

vkt1414 commented 2 months ago

I also have no idea why the tests are now failing

The tests are failing because.. the url is not correct. If we are hardcoding the urls..there is no need to use API anymore.. we need to modify.. asset_endpoint_url to https://github.com/ImagingDataCommons/idc-index-data/releases/download/{idc_index_data.__version__}

https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/18.0.1

image

fedorov commented 2 months ago

I also have no idea why the tests are now failing

What I meant to say is that I could not explain how the tests were succeeding before I resolved the conflict, but are failing now. Perhaps I messed something up while resolving the conflicts, and maybe that is the API URL. I was thinking Daniela would be best to review and confirm. I admit I did not have the time to actually review this PR.

DanielaSchacherer commented 2 months ago

@fedorov thank you for setting up the contribution README, that's very helpful and hopefully it will prevent more merge conflicts.

It was a combination of the URL (what Vamsi said) and some confusion about whether indices_overview is a dictionary or dataframe.

DanielaSchacherer commented 2 months ago

We are still missing descriptions of the indices. Is that something we should target here or does it make sense to have a separate PR for that?

fedorov commented 2 months ago

I will take a shot at that as part of my review!

fedorov commented 2 months ago

@DanielaSchacherer I guess I lost track of the original purpose of this PR while reviewing yesterday. Are you planning to update the API to allow download of the instances based on the instance index as part of this PR?

DanielaSchacherer commented 2 months ago

@fedorov For that very reason, I would prefer to later make a new PR for this and for now continue the discussion that we started in https://github.com/ImagingDataCommons/idc-index/issues/97.