Closed shntnu closed 7 months ago
Hi @shntnu thanks for opening this issue (sorry to hear it's happening) and for the thorough notes. Aside, I'm so excited to hear about Parquet being used for Cell Painting datasets in the future!
I took a look at what you provided and tried to reproduce the findings. Here's what I noticed:
When performing a SQL-based JOIN operation with DuckDB it can sometimes add additional column names per table where the column names are the same between tables. This is generalized SQL behavior which seeks to preserve the data output (leaning on a developer to specify / filter using other means). In DuckDB, this behavior is outlined here, which mentions the addition of integers per every duplicate column name.
We can see an alignment in what you shared through the column names:
Metadata_TableNumber: string <-----
Metadata_ImageNumber: int64 <-----
...
Metadata_TableNumber_1: string <-----
Metadata_ImageNumber_1: int64 <-----
And also column values (notice the repeated values for 'dd77885d07028e67dc9bcaaba4df34c6', 1
):
('dd77885d07028e67dc9bcaaba4df34c6', 1, 'A01', 'SQ00014613', 'dd77885d07028e67dc9bcaaba4df34c6', 1, ...
('1e5d8facac7508cfd4086f3e3e950182', 2, 'A01', 'SQ00014613', '1e5d8facac7508cfd4086f3e3e950182', 2, ...
My first instinct was to mention the above and a workaround within DuckDB SQL which can be implemented: the EXCLUDE
clause. EXCLUDE
allows you to specify certain columns which should be excluded (in this case, perhaps the cell.Metadata_TableNumber
and cell.Metadata_ImageNumber
columns). I found that I didn't need EXCLUDE
to reach a successful run (see below).
I tried adding the code you provided to see what would happen within a Google Colab environment. Here's a link to the Google Colab notebook and a gist with similar content (as a backup).
I didn't see the same errors which made me think that DuckDB, Arrow, or something in between might be operating differently depending on your environment. When you have the chance, could you share an environment lockfile, the output from pip freeze > env-freeze.txt
, or similar to help observe any differences which may be contributing to this? It could also be that EXCLUDE
could serve well here, but I'm not certain without more detail.
@d33bs Thank you so much for looking into this. I'll drop in my env right away and then follow up later in case I have more to add.
EXCLUDE
sounds promising
Thank you @shntnu ! I found that Google Colab was using duckdb==0.9.2
and your environment shows the use of duckdb==0.10.0
. I imagine a quick workaround may be to pin duckdb==0.9.2
. When you have the chance, would you mind testing this to confirm?
I'm digging into the specifics of the issue and giving EXCLUDE
a try. I'll follow up with more detail as soon as I have a better understanding of this.
As a quick follow up, I found that EXCLUDE
may work here using the following modified snippet. Related, I'm finding that we're no longer passing tests with duckdb==0.10.0
in CytoTable which seems to be for related reasons (a change in how SQL is processed for existing joins).
from cytotable import convert
import logging
logging.basicConfig(level=logging.ERROR)
identifying_cols = (
"TableNumber",
"ImageNumber",
"ObjectNumber",
"Metadata_Well",
"Metadata_Plate",
"Parent_Cells",
"Parent_Nuclei",
"Cytoplasm_Parent_Cells",
"Cytoplasm_Parent_Nuclei",
)
join_command = """
WITH Image_Filtered AS (
SELECT
Metadata_TableNumber,
Metadata_ImageNumber,
Image_Metadata_Well,
Image_Metadata_Plate
FROM
read_parquet('image.parquet')
)
SELECT
image.Metadata_TableNumber,
image.Metadata_ImageNumber,
image.Image_Metadata_Well,
image.Image_Metadata_Plate,
cells.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber
)
FROM
Image_Filtered AS image
LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
LEFT JOIN read_parquet('cells.parquet') AS cells ON
cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON
nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei
"""
source_path = "test_SQ00014613.sqlite"
dest_path = "test_SQ00014613.parquet"
x = convert(
source_path=source_path,
dest_path=dest_path,
identifying_columns=identifying_cols,
dest_datatype="parquet",
chunk_size=5000,
preset="cell-health-cellprofiler-to-cytominer-database",
joins=join_command,
)
When you have the chance, would you mind testing this to confirm?
Thank you for looking this up. I can confirm that the code worked fine when pegged it to v0.9.2 using mamba install python-duckdb=0.9.2
sqlite3 test_SQ00014613.sqlite .schema | grep -E 'Cells_|Cytoplasm_|Nuclei_'| awk -F\" '{print $2}'|sort > feats_sqlite.txt
python -c "import polars as pl; df = pl.read_parquet('test_SQ00014613.parquet'); print('\n'.join([name for name in df.columns if name.startswith(('Cells', 'Cytoplasm', 'Nuclei'))]))"|sort > feats_parquet.txt
diff feats_sqlite.txt feats_parquet.txt
907,908d906
< Cytoplasm_Parent_Cells
< Cytoplasm_Parent_Nuclei
I can also confirm that EXCLUDE
fix worked with v0.10, although it needed an extra exclude of Metadata_ObjectNumber
to allow all compartments
This gives us everything we need to proceed. Thank you @d33bs!
Some notes/questions
presets.py
file in understanding how to configure.identifying_columns
and what to exclude. E.g., Cells_ObjectNumber
and Nuclei_ObjectNumber
was not in there before, and it worked fine. But then I added it after seeing the presets.py file, and it still worked fine :)Note: Updated version is in https://github.com/cytomining/CytoTable/issues/163#issuecomment-2028053389
# Author: Zitong (Sam) Chen, Broad Institute, 2023
#
# Download sample SQLite file:
# wget https://raw.githubusercontent.com/d33bs/pycytominer/43cf984067700aa52f0b6752e3490d9e12d60170/tests/test_data/cytominer_database_example_data/test_SQ00014613.sqlite -O test_SQ00014613.sqlite
from cytotable import convert
import logging
logging.basicConfig(level=logging.ERROR)
identifying_cols = (
"TableNumber",
"ImageNumber",
"ObjectNumber",
"Metadata_Well",
"Metadata_Plate",
"Parent_Cells",
"Parent_Nuclei",
"Cytoplasm_Parent_Cells",
"Cytoplasm_Parent_Nuclei",
"Cells_ObjectNumber",
"Nuclei_ObjectNumber",
)
join_command = """
WITH Image_Filtered AS (
SELECT
Metadata_TableNumber,
Metadata_ImageNumber,
Image_Metadata_Well,
Image_Metadata_Plate
FROM
read_parquet('image.parquet')
)
SELECT
image.*,
cells.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber
),
nuclei.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
cytoplasm.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
FROM
Image_Filtered AS image
LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
LEFT JOIN read_parquet('cells.parquet') AS cells ON
cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON
nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei
"""
source_path = "test_SQ00014613.sqlite"
dest_path = "test_SQ00014613.parquet"
x = convert(
source_path=source_path,
dest_path=dest_path,
identifying_columns=identifying_cols,
dest_datatype="parquet",
chunk_size=5000,
preset="cell-health-cellprofiler-to-cytominer-database",
joins=join_command,
)
Thanks @shntnu ! I've opened duckdb/duckdb#11157 as a result of our findings here.
Addressing your comments:
- The API docs were not as helpful as the presets.py file in understanding how to configure.
Thank you for the feedback here! Would you have any recommendations for how to improve or what you found most helpful as you found what you needed? Generally we hope to improve documentation via, for example, #25, but we could get more specific here.
- I am still confused about what columns to include in identifying_columns and what to exclude. E.g., Cells_ObjectNumber and Nuclei_ObjectNumber was not in there before, and it worked fine. But then I added it after seeing the presets.py file, and it still worked fine :)
identifying_columns
are used to help CytoTable understand which columns should not be renamed during renaming operations. These are used as a type of flag, where if provided and they exist in the source data the columns are named differently. If an identifying column is provided and it doesn't exist in the source, nothing will happen. Generally, I feel that this functionality needs improvement in the form of documentation and procedural sequence (these occur before JOIN operations, so column names must currently be inferred from CytoTable instead of from the source). From a provenance standpoint, I also wonder if/how we could improve things here to make the functionality more automatic or induce less friction from decision making here.
- Should we create a new preset based on this config below and create a PR?
My feeling is that we need feedback on whether the new behavior in duckdb==0.10.0
is expected and will remain consistent in the future. I think we should treat this as a bug in the new duckdb
release (until proven otherwise) because I couldn't find documentation to fully support the new behavior. I recommend pinning to duckdb==0.9.2
when using cytotable==0.0.4
until we have more information. Something I don't know based on the data you plan to send through CytoTable: would this enable you to use the existing join SQL from the preset? If it doesn't, please don't hesitate to advise about or open a PR to add a new preset which would more directly meet the needs.
- Anything we should keep in mind as we attempt convert ~3000 plates using this?
duckdb
.chunk_size
to a higher number to improve time duration performance, contingent on what might work best for the data involved. What would work best will depend on the amount of system memory available where you run CytoTable, the shape of the source data (row and column size), and the complexity of the join operations. It may be worth a quick test using chunk sizes [10000, 100000, 500000, 1000000]
to see what might perform the best.Hi @shntnu - thanks to review from @gwaybio in #166 and #167 we have now released a fix for the duckdb==0.10.0
issues by setting a version constraint within the latest version of CytoTable, cytotable==0.0.5
(avoiding the need to manually pin duckdb
). Please feel free to use this new release to help address the challenges you were facing in this issue.
I'd like to leave this issue open for the time being to help acknowledge that we will eventually need to update the constraint once a fix is available from a new duckdb
. It looked like a fix might be available in next releases as of linked items in https://github.com/duckdb/duckdb/issues/11157.
Please feel free to use this new release to help address the challenges you were facing in this issue.
Thank you @d33bs – I can't believe how quickly this got addressed, both by you as well as the duckdb team!
Thanks as well for all your notes in https://github.com/cytomining/CytoTable/issues/163#issuecomment-1997734313. I'll read those carefully and get back to you.
Hi @shntnu - as a heads up, in addition to what I mentioned earlier, I recommend using cytotable==0.0.6
to help address issues with memory during post join concatenation (as per #168). I've also noticed there may be issues with the default Parsl executor for CytoTable, HighThroughputExecutor (HTE), documented as part of #169 (these errors may have resulted for reasons related to the HTE or perhaps resources, but I'm not certain at this time).
As a result, I might recommend using the ThreadPoolExecutor instead which may be configured as follows:
import cytotable
import parsl
from parsl.config import Config
from parsl.executors import ThreadPoolExecutor
cytotable.convert(
...
parsl_config=parsl.load(
Config(
executors=[
ThreadPoolExecutor(
# set maximum number of threads at any time, for example 3.
# if not set, the default is 2.
max_threads=3,
)
]
)
),
)
@d33bs -- just a heads up that I've haven't been able to return to this and I might need to push it out a couple of weeks. Would that block you?
Thank you again for the quick and thorough response!
Thank you @shntnu for the updates here, I don't feel blocked here.
Once #174 is merged I feel we should focus the remainder of questions / considerations here in new issues to help uncover any additional insights which may benefit your work with CytoTable (the original challenges with DuckDB and Arrow will have been resolved I feel). The following are issues / additional focus areas I can think of related to our discussion here (please feel free to add / suggest / etc):
Closing this out with the creation of #176 and merge of #174
Thank you for the feedback here! Would you have any recommendations for how to improve or what you found most helpful as you found what you needed? Generally we hope to improve documentation via, for example, #25, but we could get more specific here.
I found this hard to pin down but in general, I'd say it's the difference between experiential learning vs. cognitive learning. Examples make things more concerete, allow you to apply it immediately, help learn the API within a real-world context, and boy do they reduce cognitive load :D API docs will pair well if they are concise and complete, so it's perfectly fine if I couldn't get it all from the API docs. I've forgotten the context to be more specific but hope that helps.
`identifying_columns`` are used to help CytoTable understand which columns should not be renamed during renaming operations.
Could you clarify what are these renaming operations? Maybe the notes below would help understand my confusion
The exercise made me realize that our lab should carefully think through the join because a lot could go wrong (this is not related to CytoTable per se)
That is, we should think through this snippet below and revise based on how our data are structured.
image
) should be an inner join I think. We don't want rows that are image
onlyimage
should be with nuclei
first because of the way the cells,nuclei,cytoplasm object hierarchy is structured. That is, join image with nuclei, then cytoplasm to nuclei, and cells to nuclei. Here, we are anchoring on cytoplasmWITH Image_Filtered AS (
SELECT
Metadata_TableNumber,
Metadata_ImageNumber,
Image_Metadata_Well,
Image_Metadata_Plate
FROM
read_parquet('image.parquet')
)
SELECT
image.*,
cells.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber
),
nuclei.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
cytoplasm.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
FROM
Image_Filtered AS image
LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
LEFT JOIN read_parquet('cells.parquet') AS cells ON
cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON
nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei
I ran the code below not knowing what I should include
and then I checked the output to see which of the columns were present or renamed
[
"TableNumber",
"ImageNumber",
"ObjectNumber",
"Metadata_Well",
"Metadata_Plate",
"Parent_Cells",
"Parent_Nuclei",
"Cytoplasm_Parent_Cells",
"Cytoplasm_Parent_Nuclei",
"Cells_ObjectNumber",
"Nuclei_ObjectNumber"
]
{
"Metadata_TableNumber": true,
"Metadata_ImageNumber": true,
"Metadata_ObjectNumber": true,
"Metadata_Cytoplasm_Parent_Cells": true,
"Metadata_Cytoplasm_Parent_Nuclei": true
}
{
"Image_Metadata_Well": true,
"Image_Metadata_Plate": true
}
{}
From that, it looks like these columns are not present in any of the tables in the SQLite, which is true
"Parent_Cells",
"Parent_Nuclei",
"Cells_ObjectNumber",
"Nuclei_ObjectNumber"
sqlite3 test_SQ00014613.sqlite .schema|grep -e "Parent_Cells" -e "Parent_Nuclei" -e "Cells_ObjectNumber" -e "Nuclei_ObjectNumber"
"Cytoplasm_Parent_Cells" BIGINT NOT NULL,
"Cytoplasm_Parent_Nuclei" BIGINT NOT NULL,
"Cells_Parent_Nuclei" BIGINT NOT NULL,
So that means I should change my cols to
identifying_cols = (
"TableNumber",
"ImageNumber",
"ObjectNumber",
"Metadata_Well",
"Metadata_Plate",
"Cytoplasm_Parent_Cells",
"Cytoplasm_Parent_Nuclei",
)
Please feel free to use this new release to help address the challenges you were facing in this issu
It worked great!
I factored in all your advice in https://github.com/cytomining/CytoTable/issues/163#issuecomment-1997734313 to create a new script
# Author: Zitong (Sam) Chen, Broad Institute, 2023
#
# Download sample SQLite file:
# wget https://raw.githubusercontent.com/d33bs/pycytominer/43cf984067700aa52f0b6752e3490d9e12d60170/tests/test_data/cytominer_database_example_data/test_SQ00014613.sqlite -O test_SQ00014613.sqlite
from cytotable import convert
import logging
import parsl
from parsl.config import Config
from parsl.executors import ThreadPoolExecutor
logging.basicConfig(level=logging.ERROR)
identifying_cols = (
"TableNumber",
"ImageNumber",
"ObjectNumber",
"Metadata_Well",
"Metadata_Plate",
"Cytoplasm_Parent_Nuclei",
"Cells_Parent_Nuclei",
)
join_command = """
WITH Image_Filtered AS (
SELECT
Metadata_TableNumber,
Metadata_ImageNumber,
Image_Metadata_Well,
Image_Metadata_Plate
FROM
read_parquet('image.parquet')
)
SELECT
image.*,
nuclei.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber
),
cells.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
cytoplasm.* EXCLUDE(
Metadata_TableNumber,
Metadata_ImageNumber,
Metadata_ObjectNumber
),
FROM
Image_Filtered AS image
INNER JOIN read_parquet('nuclei.parquet') AS nuclei ON
nuclei.Metadata_TableNumber = image.Metadata_TableNumber
AND nuclei.Metadata_ImageNumber = image.Metadata_ImageNumber
INNER JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
AND cytoplasm.Metadata_Cytoplasm_Parent_Nuclei = nuclei.Metadata_ObjectNumber
INNER JOIN read_parquet('cells.parquet') AS cells ON
cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
AND cells.Metadata_Cells_Parent_Nuclei = nuclei.Metadata_ObjectNumber
"""
source_path = "test_SQ00014613.sqlite"
dest_path = "test_SQ00014613.parquet"
x = convert(
source_path=source_path,
dest_path=dest_path,
identifying_columns=identifying_cols,
dest_datatype="parquet",
chunk_size=5000,
preset="cell-health-cellprofiler-to-cytominer-database",
joins=join_command,
parsl_config=parsl.load(
Config(
executors=[
ThreadPoolExecutor(
# set maximum number of threads at any time, for example 3.
# if not set, the default is 2.
max_threads=3,
)
]
)
),
)
This does not factor in this advice:
- Consider increasing your
chunk_size
to a higher number to improve time duration performance, contingent on what might work best for the data involved. What would work best will depend on the amount of system memory available where you run CytoTable, the shape of the source data (row and column size), and the complexity of the join operations. It may be worth a quick test using chunk sizes[10000, 100000, 500000, 1000000]
to see what might perform the best.
We are planning to convert several SQLite files to Parquet, and (potentially) make the joined Parquet the de facto standard for Cell Painting datasets in the future.
I get this error cytotable_error.txt when running the code below (
CytoTable==0.0.4
)There's something funky happening; from the log:
There is no error if I change
to
I adapted the script that @Zitong-Chen-16 had written; it seemed to have worked for her
https://github.com/broadinstitute/2021_09_01_VarChAMP/blob/main/6.downstream_analysis/scripts/0.convert_to_parquet.py
The script errors out after producing the
[cells|nuclei|cytoplasm|image].parquet
files. So I took advantage of this and poked around to see if the join query works if I try to do it directly like this below; it does:Output: