Open sarakh1999 opened 1 month ago
Thanks so much @sarakh1999 for raising this issue! I've noticed that parallel or multithreaded processing does sometimes have issues within Jupyter environments (whether through Parsl, built-ins, or otherwise). That said, I've regularly used the latest revisions of CytoTable inside of local Jupyter environments and also through Google Colab.
Could I ask for more detail surrounding your Python version, Jupyter environment, OS, and other system information that might be helpful for debugging? If you use environment files (like a conda .yml
or pyproject.toml
file) or lockfiles (such as poetry.lock
or conda-lock.yml
) these also might be helpful in helping to navigate through any dependency issues that may be happening.
Take a look at the following function:
import pandas as pd import polars as pl import pyarrow as pa import pyarrow.parquet as pq from cytotable import convert from parsl.config import Config from parsl.executors import ThreadPoolExecutor import random import sqlite3
Constants for columns
COLUMNS = ( "TableNumber", "ImageNumber", "ObjectNumber", "Metadata_Well", "Metadata_Plate", "Cytoplasm_Parent_Cells", "Cytoplasm_Parent_Nuclei", )
Modified convert_parquet function to read data chunk by chunk
def convert_parquet( input_file, output_file, cols=COLUMNS, chunk_size=150000, thread=2, initial_offset=0, offset_step=100 ):
When I run it in jupyter notebook, the function does not work and raise some error regarding threading, but when I run the exact same code on python script it has no issue. I guess the issue is kernel and multithreading in jupyter notebook that this library can't handle.