bluesky / databroker

Unified API pulling data from multiple sources
https://blueskyproject.io/databroker
BSD 3-Clause "New" or "Revised" License
33 stars 45 forks source link

Accept numpy.int64 as databroker catalog key #804

Closed padraic-shafer closed 5 months ago

padraic-shafer commented 5 months ago

Description

Databroker will now accept a numpy.int64 value as a key for accessing scans in a databroker catalog, as long as that value can be safely converted to a python int (int32). Otherwise a KeyError is raised. Behavior of using the int-valued key is unchanged by this PR: The key is treated as a scan_id value if positive, or the Nth-most recent scan index if negative. Suitable unit tests have been added to verify this.

Motivation and Context

Databroker is frequently used in conjunction with pandas. By default pandas imports integer columns as numpy.int64; databroker then raises an exception when someone tries to use the int64-valued scan_id as a catalog key.

Although type conversion could be done manually by the calling code, this is such a common scenario that it would improve out-of-the-box user experience to handle this case within databroker.

### Note: Stacked PR This PR builds upon outstanding PR https://github.com/bluesky/databroker/pull/803. Please wait until https://github.com/bluesky/databroker/pull/803 has been accepted before merging this PR.

How Has This Been Tested?

Passes linter and unit tests in GH actions of forked repo.

EDITED to clarify the intention that the int64-valued key be treated as an int-valued key; other than that, databroker behaves no differently.

padraic-shafer commented 5 months ago

@ambarb I believe this fixes the issue that you observed

padraic-shafer commented 5 months ago

Rebased this PR onto main to simplify the PR review. No other changes since previous review.

tacaswell commented 5 months ago

I don't understand why we have to check the value of the int:

python
Python 3.11.8 (main, Feb 12 2024, 14:50:05) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> int(2**456)
186070713419675363980626894819329160794532188335953423432061490990243657757029868371504908982723472783555205531204141550984858016925351936
>>>

Python's native int does not have a fixed bit length (it is one of the BigInt versions...but I don't remember off the top of my head how it works).

padraic-shafer commented 5 months ago

Python's native int does not have a fixed bit length (it is one of the BigInt versions...but I don't remember off the top of my head how it works).

You make a good point. I thought I had a good reason for this, but some changes in the approach were made since then. I'll go back and try without size cutoff under the current scenario.

padraic-shafer commented 5 months ago

@tacaswell Of course you are right about the variable-length int in python 3. This makes this PR vastly simpler.

I've added a separate test to distinguish between key values that are large vs. out-of-range for the catalog (as opposed to overflow of the binary int representation).

I'll go back and update the PR description above.

padraic-shafer commented 5 months ago

After the latest updates: Net changes to databroker code are very minimal now. Test cases have been expanded.