ICB-DCM / pyABC

distributed, likelihood-free inference
https://pyabc.rtfd.io
BSD 3-Clause "New" or "Revised" License
205 stars 44 forks source link

Develop branch with pandas>=2.2.0 fails with "Query must be a string unless using sqlalchemy" #632

Closed omsai closed 8 months ago

omsai commented 8 months ago

Bug description Currently, the develop branch of pyabc requires SQLAlchemy<2.0.0. However, pandas now only supports using SQLAlchemy>=2.0.0 according to https://pandas.pydata.org/docs/getting_started/install.html#sql-databases If one uses pandas>=2.2.0 it gives this traceback:

$ hatch -v run test:pytest
cmd [1] | pytest --pdb -s
============================= test session starts ==============================
platform darwin -- Python 3.11.7, pytest-8.0.2, pluggy-1.4.0
rootdir: /Users/pnanda/src/pyabc-bug
collected 2 items                                                              

src/test_modelcal.py .ABC.History INFO: Start <ABCSMC id=1, start_time=2024-03-08 15:54:53>
ABC INFO: Calibration sample t = -1.
ABC.History INFO: Done <ABCSMC id=1, duration=0:00:00.082585, end_time=2024-03-08 15:54:53>
F
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> captured log >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
INFO     ABC.History:history.py:488 Start <ABCSMC id=1, start_time=2024-03-08 15:54:53>
INFO     ABC:smc.py:550 Calibration sample t = -1.
INFO     ABC.History:history.py:696 Done <ABCSMC id=1, duration=0:00:00.082585, end_time=2024-03-08 15:54:53>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

    def test_fit():
>       fit(max_nr_populations=10)

src/test_modelcal.py:17: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/modelcal.py:79: in fit
    abc.run(**kwargs)
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pyabc/inference/smc.py:62: in wrapped_run
    ret = run(self, *args, **kwargs)
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pyabc/inference/smc.py:687: in run
    t0: int = self.initialize_components_before_run(
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pyabc/inference/smc.py:768: in initialize_components_before_run
    self._initialize_dist_eps_acc(t0)
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pyabc/inference/smc.py:463: in _initialize_dist_eps_acc
    self.distance_function.initialize(
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pyabc/distance/aggregate.py:271: in initialize
    sample = get_sample()
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pyabc/inference/smc.py:445: in get_initial_sample
    population = self._get_initial_population(t - 1)
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pyabc/inference/smc.py:525: in _get_initial_population
    population = self._sample_from_prior(t)
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pyabc/inference/smc.py:559: in _sample_from_prior
    ana_vars=self._vars(t=t),
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pyabc/inference/smc.py:1074: in _vars
    prev_eps=eps_from_hist(history=self.history, t=t - 1),
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pyabc/inference_util/inference_util.py:622: in eps_from_hist
    pops = history.get_all_populations()
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pyabc/storage/history.py:42: in f_wrapper
    res = f(self, *args, **kwargs)
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pyabc/storage/history.py:414: in get_all_populations
    df = pd.read_sql_query(query.statement, self._engine)
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pandas/io/sql.py:526: in read_sql_query
    return pandas_sql.read_query(
../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pandas/io/sql.py:2736: in read_query
    cursor = self.execute(sql, params)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pandas.io.sql.SQLiteDatabase object at 0x14b00a250>
sql = <sqlalchemy.sql.selectable.Select object at 0x14b4efcd0>, params = None

    def execute(self, sql: str | Select | TextClause, params=None):
        if not isinstance(sql, str):
>           raise TypeError("Query must be a string unless using sqlalchemy.")
E           TypeError: Query must be a string unless using sqlalchemy.

../../Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pandas/io/sql.py:2668: TypeError
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB post_mortem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> /Users/pnanda/Library/Application Support/hatch/env/virtual/calibrate/62I2aHIQ/test/lib/python3.11/site-packages/pandas/io/sql.py(2668)execute()
-> raise TypeError("Query must be a string unless using sqlalchemy.")

Expected behavior Remove restriction of SQLAlchemy<2.0.0 to avoid this and other potential undefined behavior with pandas. This problem is not present in the main branch; only the develop branch.

To reproduce

$ tree -f
.
├── ./pyproject.toml
└── ./src
    ├── ./src/modelcal.py
    └── ./src/test_modelcal.py

2 directories, 3 files
$ grep pandas pyproject.toml
    "pandas<2.2.0",
$ hatch -v run test:pytest --pdb -s
...
============================== 2 passed in 5.40s ===============================
$ rm -f pyabc.db  # otherwise hatch will throw "ValueError: Unable to determine which files to ship inside the wheel ..."
$ rm -r "$(hatch config show | sed -E -n '/data/ s/data = .(.+).$/\1/p')"/env/virtual/calibrate # remove the environment to reinstall with new dependencies
$ grep pandas pyproject.toml
    "pandas>=2.2.0",
$ hatch -v run test:pytest --pdb -s
...
-> raise TypeError("Query must be a string unless using sqlalchemy.")
# pyproject.toml
[build-system]
build-backend = "hatchling.build"
requires = [
    "hatchling",
]

[project]
name = "calibrate"
version = "0.1"
description = "Calibrate example model to available datasets"
dependencies = [
    "pyabc[pyarrow] @git+https://github.com/ICB-DCM/pyABC@develop",
    "pandas>=2.2.0",
]

[tool.hatch.envs.test]
dependencies = [
    "pytest>=8.0.0",
]

[tool.hatch.metadata]
# To use develop git branch of pyabc.
allow-direct-references = true
# src/modelcal.py
"""Fit simple predator-prey model from DOI: 10.3389/fams.2023.1256443"""

import numpy as np
import pandas as pd
import pyabc
from scipy.integrate import odeint

# Initial conditions.
X0 = [10.0, 5.0]
# Timepoints to solve.
N_TIME = 100
TIME = np.linspace(0, 15, N_TIME)
# Fixed parameters.
C = 1.5
D = 0.75

def arr_predator_prey(X, t, a, b):
    '''Return growth rate of prey and predator populations.'''
    return np.array([a*X[0] - b*X[0]*X[1],
                     -C*X[1] + D*b*X[0]*X[1]])

def observed_sum_stat():
    '''Return DataFrame of predator and prey simulated data with noise.'''
    # Ground truth parameters to create data before adding any noise.
    a = 1.0
    b = 0.1
    soln = odeint(arr_predator_prey, X0, TIME, rtol=0.01,
                  args=(a, b, ))
    # Add noise.
    noise = np.random.normal(size=(N_TIME, 2, ))
    noisy = soln + noise
    # The variables should never be negative.
    noisy[noisy < 0] = 0
    return {
        'obs_stats': pd.DataFrame(noisy,
                                  columns=('predator', 'prey', )),
    }

def distance(sim, obs, column):
    diff = np.abs(
        sim['stats'].loc[:, [column]].to_numpy() -
        obs['obs_stats'].loc[:, [column]].to_numpy()
    )
    return np.median(diff)

def distance_predator(sim, obs):
    return distance(sim, obs, 'predator')

def distance_prey(sim, obs):
    return distance(sim, obs, 'prey')

def model_wrap(params):
    soln = odeint(arr_predator_prey, X0, TIME, rtol=0.01,
                  args=(params['a'], params['b'], ))
    return {
        'stats': pd.DataFrame(soln, columns=('predator', 'prey', )),
    }

def fit(**kwargs):
    priors = pyabc.Distribution({
        'a': pyabc.RV('uniform', 0.01, 2),
        'b': pyabc.RV('uniform', 0.01, 2),
    })
    model = model_wrap
    distance = pyabc.distance.AdaptiveAggregatedDistance([
        distance_predator, distance_prey])
    sampler = pyabc.sampler.SingleCoreSampler()
    try:
        abc = pyabc.ABCSMC(model, priors, distance, sampler=sampler)
        abc.new("sqlite:///pyabc.db", observed_sum_stat())
        abc.run(**kwargs)
    finally:
        # https://github.com/ICB-DCM/pyABC/issues/386#issuecomment-753325664
        sampler.stop()
# src/test_modelcal.py
import numpy as np
import pyabc

from modelcal import observed_sum_stat, N_TIME, fit

def test_observed_sum_stat():
    dict_df = observed_sum_stat()
    df = dict_df['obs_stats']
    assert all(df.columns == ['predator', 'prey'])
    assert len(df) == N_TIME
    assert df.to_numpy().size == N_TIME * 2
    assert df.to_numpy().min() >= 0

def test_fit():
    fit(max_nr_populations=10)
    history = pyabc.History("sqlite:///pyabc.db")
    assert history.total_nr_simulations
    assert history.n_populations
    params = history.get_distribution()[0].median().to_numpy()
    assert all(np.isclose(params, [1.0, 0.1], rtol=0.1))

Environment

$ sw_vers 
ProductName:        macOS
ProductVersion:     14.3.1
BuildVersion:       23D60

$ hatch shell

$ python --version
Python 3.11.7

$ pip list
Package            Version
------------------ -----------
async-timeout      4.0.3
click              8.1.7
cloudpickle        3.0.0
contourpy          1.2.0
cycler             0.12.1
dask               2024.2.1
distributed        2024.2.1
fonttools          4.49.0
fsspec             2024.2.0
gitdb              4.0.11
GitPython          3.1.42
importlib_metadata 7.0.2
jabbar             0.0.16
Jinja2             3.1.3
joblib             1.3.2
kiwisolver         1.4.5
locket             1.0.0
MarkupSafe         2.1.5
matplotlib         3.8.3
msgpack            1.0.8
numpy              1.26.4
packaging          23.2
pandas             2.2.1
partd              1.4.1
pillow             10.2.0
pip                24.0
psutil             5.9.8
pyabc              0.12.13
pyarrow            15.0.1
pyparsing          3.1.2
python-dateutil    2.9.0.post0
pytz               2024.1
PyYAML             6.0.1
redis              5.0.2
scikit-learn       1.4.1.post1
scipy              1.12.0
setuptools         69.1.0
six                1.16.0
smmap              5.0.1
sortedcontainers   2.4.0
SQLAlchemy         1.4.52
tblib              3.0.0
threadpoolctl      3.3.0
toolz              0.12.1
tornado            6.4
tzdata             2024.1
urllib3            2.2.1
wheel              0.42.0
zict               3.0.0
zipp               3.17.0
stephanmg commented 8 months ago

@omsai check latest commit on develop branch - should be fixed.

omsai commented 8 months ago

Yes, I checked that it's fixed now - thank you.