cytomining / CytoTable

Transform CellProfiler and DeepProfiler data for processing image-based profiling readouts with Pycytominer and other Cytomining tools.
https://cytomining.github.io/CytoTable/
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link

Chunk_size parameter when using with preset + joins variables causes ordering of `Metadata_ImageNumber` column #85

Open jenna-tomkinson opened 1 year ago

jenna-tomkinson commented 1 year ago

Problem

When using the chunk_size parameter above 1000 with my code, it is causing the "ImageNumber" column in the per_nuclei.parquet file to start from 695 and not 1. This causes the file after concat to only have 695 to the last image number, but it has the correct number of rows for nuclei compared to the SQLite file. I had to change from LEFT JOIN to INNER so that I could get the correct number of rows based on the SQLite file. Using LEFT or RIGHT caused chaos because it was generating extra rows where the nuclei features were all NaN.

Here is the code I was using:

# Only select from the two compartments that we have and merge them
preset_join = """
SELECT
    *
FROM
    read_parquet('per_image.parquet') as per_image
INNER JOIN read_parquet('per_nuclei.parquet') AS per_nuclei ON
    per_nuclei.Metadata_ImageNumber = per_image.Metadata_ImageNumber
"""

for plate_folder in sqlite_dir.iterdir():
    output_path = pathlib.Path(f"{output_dir}/{plate_folder.stem}_converted.parquet")
    # merge single cells and output as parquet file
    convert(
        source_path=str(plate_folder),
        dest_path=str(output_path),
        dest_datatype=dest_datatype,
        preset=preset,
        joins=preset_join,
        parsl_config=local_htex,
        chunk_size=10000
    )
    break

Solution

If I don't use the chunk_size parameter, it looks like it is fixed. But, it goes from running for 1 minute to 8 minutes without the chunk_size increased. I am not sure what is going on at all.

Please feel free to change the title to make it more concise and I am happy to explain and demonstrate with my dataset what is happening.

jenna-tomkinson commented 1 year ago

Update

It looks like when using the unique() function, it was printing all of the unique values in the column "Metadata_ImageNumber" in the same order as the row index, so that is why I saw 695 at the start:

image

Once I sorted in order by the "Metadata_ImageNumber" column, I could see that the values for this did not start from 695 and leave out all numbers before:

image

But, I am find that I am missing ImageNumbers and wells. This can be a different issue, but I wanted to make sure this is known. I would wish to see that the image numbers are in order and not starting at a random number.

d33bs commented 1 year ago

Thank you for opening this issue @jenna-tomkinson ! From what I read, you highlighted two challenges in this issue. I'll try to organize these below and respond to them afterwards. Please don't hesitate to let me know if these align with what you're seeing. I'd be interested in getting your input on these.

  1. __The CytoTable parquet output column Metadata_ImageNumber is not sorted.__

I don't believe CytoTable currently sorts values of the output. The reason for the mixed sorting is due to indeterministic unsorted chunked concatenation of smaller tables. The values aren't currently sorted as this would require additional complexity and I wasn't certain about how opinionated CytoTable should be when it came to this.

That said, should the values be sorted in a particular way here? If so, which column(s) should be used and which sorting direction would be preferred (ascending, descending, etc)?

A pattern used in testing for sorting values and determining table equivalency is to sort by all columns ascendingly (see below).

table = parquet.read_table("file.parquet")
sorted_table = table.sort_by(
    [(name, "ascending") for name in test_table.column_names]
)

  1. __There are values missing from the CytoTable parquet output column Metadata_ImageNumber and Wells (unsure of specific column naming here).__

When performing a SQL INNER JOIN we're telling the engine to only return values for which join key values exist in both tables. While I can't be certain without seeing the dataset, I think there may be different Metadata_ImageNumber values in the Nuclei and Image tables being referenced, which would result in possible missing values. A LEFT or RIGHT JOIN would show missing values due to the same reasons. This webpage offers a good description (especially the visual, midway down the page) of how different SQL JOIN's operate.

Could you reference the dataset you're using so we may further troubleshoot this here?

Just in case, I wrote the following example using Pandas DataFrames and DuckDB SQL queries over these DataFrames to simulate / demonstrate the difference between SQL JOIN types. While these are DataFrames instead of SQLite tables being queried, the functionality acts the same.

import duckdb
import pandas as pd

# create an image dataframe from a dictionary
df_image = pd.DataFrame(
    {
        "ImageNumber": [1, 3, 4, 5],
        "Image_Feature_x": [0.1, 0.3, 0.4, 0.5],
    }
)

# create an nuclei dataframe from a dictionary
df_nuclei = pd.DataFrame(
    {
        "ImageNumber": [1, 2, 3, 4],
        "Nuclei_Feature_y": [0.5, 0.4, 0.3, 0.2],
    }
)

# show the initial dataframes
print(df_image, end="\n\n")
print(df_nuclei, end="\n\n")

# show the mismatching ImageNumber series
print(
    "df_image.ImageNumber equals df_nuclei.ImageNumber: ",
    df_image["ImageNumber"].equals(df_nuclei["ImageNumber"]),
    end="\n\n",
)

# demonstrate the result of an inner join
print("Example INNER JOIN:")
sql_stmt = """
SELECT
    image.ImageNumber,
    image.Image_Feature_x,
    nuclei.Nuclei_Feature_y
FROM df_image image
INNER JOIN df_nuclei AS nuclei ON
    nuclei.ImageNumber = image.ImageNumber
"""
print(
    duckdb.connect().execute(sql_stmt).df(),
    end="\n\n",
)

# demonstrate the result of a left join
print("Example LEFT JOIN:")
sql_stmt = """
SELECT
    image.ImageNumber,
    image.Image_Feature_x,
    nuclei.Nuclei_Feature_y
FROM df_image image
LEFT JOIN df_nuclei AS nuclei ON
    nuclei.ImageNumber = image.ImageNumber
"""
print(
    duckdb.connect().execute(sql_stmt).df(),
    end="\n\n",
)

# demonstrate the result of a right join
print("Example RIGHT JOIN:")
sql_stmt = """
SELECT
    image.ImageNumber,
    image.Image_Feature_x,
    nuclei.Nuclei_Feature_y
FROM df_image image
RIGHT JOIN df_nuclei AS nuclei ON
    nuclei.ImageNumber = image.ImageNumber
"""
print(
    duckdb.connect().execute(sql_stmt).df(),
    end="\n\n",
)
jenna-tomkinson commented 1 year ago

@d33bs

Thank you for the response!

Regarding the sorting of the values from ImageNumber column, that makes a lot of sense why the values aren't sorted in ascending order! I didn't know that and it isn't a huge deal, I was expecting to see ascending order but I can easily reorder the rows in ascending order after.

For the missing values issue, I have gone through the SQLite file a bunch of times and I am seeing that there are ImageNumbers associated to wells that are not included in the final conversion. It is even more weird because the number of rows matches that of the SQLite file so how is there missing values?

It is very confusing. The dataset comes from a private repository, and I had to rerun the analysis so the data wouldn't be available if I gave access. I think having an in-person/remote discussion would help.

Thank you for the response!

d33bs commented 1 year ago

Moving forward, I propose dedicating focus on mismatching or missing data in #86. This issue has some relevant discussion beyond that of #86 which could turn into new issues for: