[BUG] Issues when storing/loading Qrels from a dataframe and a parquet file.

AmenRa / ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

https://amenra.github.io/ranx

MIT License

427 stars 23 forks source link

[BUG] Issues when storing/loading Qrels from a dataframe and a parquet file. #53

Closed knife982000 closed 11 months ago

knife982000 commented 11 months ago

Describe the bug Bug when reconstructing Qrels from a pandas dataframe. This bug affects also when reading a Qrel from a parquet file as the pandas to Qrels is used internally.

Pandas version: 1.5.2 Ranx: last pip version

To Reproduce Code:

from ranx import Qrels
qrels = Qrels({'1':{'1':1}, '2':{'2': 1}})
df = qrels.to_dataframe()
Qrels.from_df(df)

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\lib\site-packages\ranx\data_structures\qrels.py", line 300, in from_df
    assert df[score_col].dtype == int, "DataFrame scores column dtype must be `int`"
AssertionError: DataFrame scores column dtype must be `int`

About the dataframe dtypes It is using int64 instead of int

>>> df.dtypes
q_id      object
doc_id    object
score      int64
dtype: object

>>> import numpy as np
>>> df.dtypes['score'] == np.int64
True

AmenRa commented 11 months ago

Hi, thanks for reporting this issue. I will fix it in the next release.

AmenRa commented 11 months ago

Hi, I cannot reproduce.

int64 is standard Python int.

With ranx:

from ranx import Qrels

qrels = Qrels({"1": {"1": 1}, "2": {"2": 1}})
df = qrels.to_dataframe()

df.dtypes["score"] == int

Output:

True
True

With Pandas, to be sure there is no dtype conversion:

from pandas import DataFrame

df = DataFrame.from_dict({"q_id": ["1", "2"], "doc_id": ["1", "2"], "score": [1, 1]})

df.dtypes["score"] == int

Output:

True
True

knife982000 commented 11 months ago

Hi,

I looked more into the problem, and the issue is most likely related to the operating system.

I run the following code on different machines to which I have access.

import pandas as pd

df = pd.DataFrame(data=[[1,2],[2,3]], columns=['a', 'b'])

df.dtypes['a'] == int

and the output changes with the OS. For Linux & MacOS: true For Windows: false

All the installations are based on a conda environment. That might be why you cannot reproduce the bug.

The numpy behaviour of dtype is OS dependent: https://numpy.org/doc/stable/user/basics.creation.html

AmenRa commented 11 months ago

Sorry to hear that. Yes, I work only with MacOS and Linux machines. Have you tried using WSL?

knife982000 commented 11 months ago

WSL output is consistent with Linux. So, it is only a windows issue. Anyway, I think that testing for np.int64 instead of should work and it is platform independent.

Thank you for answering. I really like your library.

AmenRa commented 11 months ago

I changed the data type check as you suggested. Please update to v0.3.18.

Thank you.