aphp / eds-scikit

eds-scikit is a Python library providing tools to process and analyse OMOP data
https://aphp.github.io/eds-scikit
BSD 3-Clause "New" or "Revised" License
35 stars 5 forks source link

Allow person_ids arg in HiveData read_table #47

Closed TheooJ closed 1 year ago

TheooJ commented 1 year ago

Description

When using HiveData with the person_ids argument, calling a table (i.e. data.visit_occurrence) results in an error: AttributeError: 'set' object has no attribute 'columns'.

This is because we're attempting to merge df (the table) on self.person_ids --which is a set-- and not self.person_ids_df --which is the DataFrame of the person_ids to keep--.

Description

https://github.com/aphp/eds-scikit/blob/001fe9bd139fdee10ffc78129bdafcfd9fcfbad8/eds_scikit/io/hive.py#L228 Replace self.person_ids by self.person_ids_df

https://github.com/aphp/eds-scikit/blob/001fe9bd139fdee10ffc78129bdafcfd9fcfbad8/eds_scikit/io/hive.py#L226 Should be after df = df.join, otherwise we're joining a Spark with a Koalas dataframe.

How to reproduce the bug before fix

import pandas as pd
from eds_scikit.io import HiveData
from eds_scikit import improve_performances

spark, sc, sql = improve_performances()

# This will run and return "Number of unique patients: 2"
data = HiveData(
    database_name="cse_210037_20220831",
    database_type="I2B2",
    person_ids=pd.Series(["person_id_1", "person_id_2"])
)

# This will not work
len(data.visit_occurrence)

Error log

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_8025/3547872499.py in <module>
----> 1 len(data.note_deid)

~/.user_conda/miniconda/envs/myenv/lib/python3.7/site-packages/eds_scikit/io/hive.py in __getattr__(self, table_name)
    374         elif table_name in self.available_tables:
    375             # Add to cache dictionnary during the first call.
--> 376             table = self._read_table(table_name)
    377             self._tables[table_name] = table
    378             return table

~/.user_conda/miniconda/envs/myenv/lib/python3.7/site-packages/eds_scikit/io/hive.py in _read_table(self, table_name, person_ids, to_koalas)
    228         person_ids = person_ids or self.person_ids
    229         if "person_id" in df.columns and person_ids is not None:
--> 230             df = df.join(person_ids, on="person_id", how="inner")
    231
    232         df = clean_dates(df)

~/.user_conda/miniconda/envs/myenv/lib/python3.7/site-packages/databricks/koalas/frame.py in join(self, right, on, how, lsuffix, rsuffix)
   7889             common = list(self.columns.intersection([right.name]))
   7890         else:
-> 7891             common = list(self.columns.intersection(right.columns))
   7892         if len(common) > 0 and not lsuffix and not rsuffix:
   7893             raise ValueError(

AttributeError: 'set' object has no attribute 'columns'

Checklist

codecov-commenter commented 1 year ago

Codecov Report

Patch and project coverage have no change.

Comparison is base (001fe9b) 83.84% compared to head (62f463c) 83.84%.

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #47 +/- ## ======================================= Coverage 83.84% 83.84% ======================================= Files 82 82 Lines 2494 2494 ======================================= Hits 2091 2091 Misses 403 403 ``` | [Impacted Files](https://app.codecov.io/gh/aphp/eds-scikit/pull/47?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aphp) | Coverage Δ | | |---|---|---| | [eds\_scikit/io/hive.py](https://app.codecov.io/gh/aphp/eds-scikit/pull/47?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aphp#diff-ZWRzX3NjaWtpdC9pby9oaXZlLnB5) | `100.00% <ø> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.