astronomy-commons / hipscat-import

HiPSCat import - generate HiPSCat-partitioned catalogs
https://hipscat-import.readthedocs.io
BSD 3-Clause "New" or "Revised" License
5 stars 3 forks source link

Decode utf8 when using astropy.table.Table.to_pandas() #281

Closed Schwarzam closed 3 months ago

Schwarzam commented 4 months ago

Feature request or suggestion

The current implementation of converting FITS data to pandas DataFrames using astropy.table.Table.to_pandas() results in string columns being returned as byte strings. Example:

Field ID
b'STRIPE82-0001' b'iDR4_3_STRIPE82-0001_0018523'
b'STRIPE82-0001' b'iDR4_3_STRIPE82-0001_0018586'
b'STRIPE82-0001' b'iDR4_3_STRIPE82-0001_0018701'
b'STRIPE82-0001' b'iDR4_3_STRIPE82-0001_0018856'
b'STRIPE82-0001' b'iDR4_3_STRIPE82-0001_0018876'

This could be solved by adding this on FitsReader.read():

while read_rows < total_rows:
  # - Changes -
  df_chunk = table[read_rows : read_rows + self.chunksize].to_pandas()
  for column in df_chunk.columns:
      if df_chunk[column].dtype == object and df_chunk[column].apply(lambda x: isinstance(x, bytes)).any():
          df_chunk[column] = df_chunk[column].apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)

  yield df_chunk
  # --------

  read_rows += self.chunksize

I ran some tests to check the performance on some ways to decode 2 columns with 10mi objects.

import pandas as pd
import numpy as np
import timeit

# Generate a large DataFrame
np.random.seed(0)
size = 100_000_000
data = {
    'bytes_col1': [b'hello', b'world', b'test', b'data'] * (size // 4),
    'bytes_col2': [b'python', b'pandas', b'numpy', b'astropy'] * (size // 4),
    'int_col': np.random.randint(1, 100, size),
    'float_col': np.random.random(size)
}
df = pd.DataFrame(data)

def method_stack_unstack(df):
    str_columns = df.select_dtypes([object])
    df[str_columns.columns] = str_columns.stack().str.decode('utf-8').unstack()
    return df

def method_direct_apply(df):
    for column in df.columns:
        if df[column].dtype == object:
            df[column] = df[column].apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)
    return df

def method_vectorized_str_decode(df):
    for column in df.columns:
        if df[column].dtype == object:
            df[column] = df[column].str.decode('utf-8')
    return df

def method_list_comprehension(df):
    for column in df.columns:
        if df[column].dtype == object:
            df[column] = [x.decode('utf-8') for x in df[column]]
    return df

# Function to measure execution time
def measure_time(method, df):
    setup_code = f"from __main__ import df, {method.__name__}"
    stmt = f"{method.__name__}(df.copy())"
    times = timeit.repeat(setup=setup_code, stmt=stmt, repeat=3, number=1)
    return min(times)

# Comparing performance
methods = [method_stack_unstack, method_direct_apply, method_vectorized_str_decode, method_list_comprehension]
results = {method.__name__: measure_time(method, df) for method in methods}

for name, time in results.items():
    print(f"Time taken by {name}: {time:.5f} seconds")

This is what I got on my M1.

Time taken by method_stack_unstack: 7.13616 seconds Time taken by method_direct_apply: 2.56532 seconds Time taken by method_vectorized_str_decode: 2.86572 seconds Time taken by method_list_comprehension: 2.42810 seconds

nevencaplar commented 4 months ago

@Schwarzam This solutions that you propose sounds good. Would you like to contribute it to the codebase? We made you a member of hipscat-friends group which should make it possible for you to contribute to the github directly.