pyarrow.lib.ArrowInvalid schema difference errors occur with `duckdb==0.10.0`

We are planning to convert several SQLite files to Parquet, and (potentially) make the joined Parquet the de facto standard for Cell Painting datasets in the future.

I get this error cytotable_error.txt when running the code below (CytoTable==0.0.4)

There's something funky happening; from the log:

ArrowInvalid: Schema at index 0 was different:
Metadata_TableNumber: string
Metadata_ImageNumber: int64
Image_Metadata_Well: string
Image_Metadata_Plate: string
Metadata_TableNumber: string <-----
Metadata_ImageNumber: int64 <-----
Metadata_ObjectNumber: int64
...
vs
Metadata_TableNumber: string
Metadata_ImageNumber: int64
Image_Metadata_Well: string
Image_Metadata_Plate: string
Metadata_TableNumber_1: string <-----
Metadata_ImageNumber_1: int64 <-----
Metadata_ObjectNumber: int64
...

```py # Author: Zitong (Sam) Chen, Broad Institute, 2023 # # Download sample SQLite file: # wget https://raw.githubusercontent.com/d33bs/pycytominer/43cf984067700aa52f0b6752e3490d9e12d60170/tests/test_data/cytominer_database_example_data/test_SQ00014613.sqlite -O test_SQ00014613.sqlite # Examine column names: # sqlite3 test_SQ00014613.sqlite .schema|grep -v AreaShape|grep -vE "Texture|RadialDistribution|Granularity|Location|Neighbors|Correlation|Intensity|ObjectSkeleton|ExecutionTime|ImageQuality|ModuleError|PathName|MD5Digest|Scaling|Width|FileName|URL|Threshold|Height|Count|CREATE INDEX|IncludingEdges" # Output: # CREATE TABLE IF NOT EXISTS "Image" ( # "TableNumber" TEXT, # "Image_Group_Index" BIGINT NOT NULL, # "Image_Group_Number" BIGINT NOT NULL, # "ImageNumber" BIGINT NOT NULL, # "Image_Metadata_AbsPositionZ" FLOAT NOT NULL, # "Image_Metadata_AbsTime" DATETIME, # "Image_Metadata_BinningX" BIGINT NOT NULL, # "Image_Metadata_BinningY" BIGINT NOT NULL, # "Image_Metadata_ChannelID" BIGINT NOT NULL, # "Image_Metadata_ChannelName" TEXT, # "Image_Metadata_Col" BIGINT NOT NULL, # "Image_Metadata_ExposureTime" FLOAT NOT NULL, # "Image_Metadata_FieldID" BIGINT NOT NULL, # "Image_Metadata_ImageResolutionX" FLOAT NOT NULL, # "Image_Metadata_ImageResolutionY" FLOAT NOT NULL, # "Image_Metadata_ImageSizeX" BIGINT NOT NULL, # "Image_Metadata_ImageSizeY" BIGINT NOT NULL, # "Image_Metadata_MainEmissionWavelength" BIGINT NOT NULL, # "Image_Metadata_MainExcitationWavelength" BIGINT NOT NULL, # "Image_Metadata_ObjectiveMagnification" BIGINT NOT NULL, # "Image_Metadata_ObjectiveNA" BIGINT NOT NULL, # "Image_Metadata_Plate" TEXT, # "Image_Metadata_PositionX" FLOAT NOT NULL, # "Image_Metadata_PositionY" FLOAT NOT NULL, # "Image_Metadata_PositionZ" FLOAT NOT NULL, # "Image_Metadata_Row" BIGINT NOT NULL, # "Image_Metadata_Site" BIGINT NOT NULL, # "Image_Metadata_Well" TEXT, # ); # CREATE TABLE IF NOT EXISTS "Nuclei" ( # "TableNumber" TEXT, # "ImageNumber" BIGINT NOT NULL, # "ObjectNumber" BIGINT NOT NULL, # "Nuclei_Number_Object_Number" BIGINT NOT NULL, # ); # CREATE TABLE IF NOT EXISTS "Cytoplasm" ( # "TableNumber" TEXT, # "ImageNumber" BIGINT NOT NULL, # "ObjectNumber" BIGINT NOT NULL, # "Cytoplasm_Number_Object_Number" BIGINT NOT NULL, # "Cytoplasm_Parent_Cells" BIGINT NOT NULL, # "Cytoplasm_Parent_Nuclei" BIGINT NOT NULL, # ); # CREATE TABLE IF NOT EXISTS "Cells" ( # "TableNumber" TEXT, # "ImageNumber" BIGINT NOT NULL, # "ObjectNumber" BIGINT NOT NULL, # "Cells_Number_Object_Number" BIGINT NOT NULL, # "Cells_Parent_Nuclei" BIGINT NOT NULL, # ); ```

wget https://raw.githubusercontent.com/d33bs/pycytominer/43cf984067700aa52f0b6752e3490d9e12d60170/tests/test_data/cytominer_database_example_data/test_SQ00014613.sqlite -O test_SQ00014613.sqlite

from cytotable import convert
import logging

logging.basicConfig(level=logging.ERROR)

identifying_cols = (
    "TableNumber",
    "ImageNumber",
    "ObjectNumber",
    "Metadata_Well",
    "Metadata_Plate",
    "Parent_Cells",
    "Parent_Nuclei",
    "Cytoplasm_Parent_Cells",
    "Cytoplasm_Parent_Nuclei",
)

join_command = """
                    WITH Image_Filtered AS (
                        SELECT
                            Metadata_TableNumber,
                            Metadata_ImageNumber,
                            Image_Metadata_Well,
                            Image_Metadata_Plate
                        FROM
                            read_parquet('image.parquet')
                        )
                    SELECT
                        image.Metadata_TableNumber,image.Metadata_ImageNumber,image.Image_Metadata_Well,image.Image_Metadata_Plate,cells.*
                    FROM
                        Image_Filtered AS image
                    LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
                        cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
                        AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
                    LEFT JOIN read_parquet('cells.parquet') AS cells ON
                        cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
                        AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
                        AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
                    LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON
                        nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
                        AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
                        AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei
                """

source_path = "test_SQ00014613.sqlite"
dest_path = "test_SQ00014613.parquet"

x = convert(
    source_path=source_path,
    dest_path=dest_path,
    identifying_columns=identifying_cols,
    dest_datatype="parquet",
    chunk_size=5000,
    preset="cell-health-cellprofiler-to-cytominer-database",
    joins=join_command,
)

There is no error if I change

SELECT image.Metadata_TableNumber,image.Metadata_ImageNumber,image.Image_Metadata_Well,image.Image_Metadata_Plate,cells.*

SELECT image.Metadata_TableNumber,image.Metadata_ImageNumber,image.Image_Metadata_Well,image.Image_Metadata_Plate

I adapted the script that @Zitong-Chen-16 had written; it seemed to have worked for her

https://github.com/broadinstitute/2021_09_01_VarChAMP/blob/main/6.downstream_analysis/scripts/0.convert_to_parquet.py

The script errors out after producing the [cells|nuclei|cytoplasm|image].parquet files. So I took advantage of this and poked around to see if the join query works if I try to do it directly like this below; it does:

import duckdb

temp_path = "test_SQ00014613.parquet"

# Define your SQL command
join_command = f"""
                WITH Image_Filtered AS (
                    SELECT
                        Metadata_TableNumber,
                        Metadata_ImageNumber,
                        Image_Metadata_Well,
                        Image_Metadata_Plate
                    FROM
                        read_parquet('{temp_path}/image/image.parquet')
                    )
                SELECT
                    image.Metadata_TableNumber,image.Metadata_ImageNumber,image.Image_Metadata_Well,image.Image_Metadata_Plate,cells.*
                FROM
                    Image_Filtered AS image
                LEFT JOIN read_parquet('{temp_path}/cytoplasm/cytoplasm.parquet') AS cytoplasm ON
                    cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
                    AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
                LEFT JOIN read_parquet('{temp_path}/cells/cells.parquet') AS cells ON
                    cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
                    AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
                    AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
                LEFT JOIN read_parquet('{temp_path}/nuclei/nuclei.parquet') AS nuclei ON
                    nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
                    AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
                    AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei
                """

conn = duckdb.connect(database=":memory:", read_only=False)

result = conn.execute(join_command).fetchall()

for row in result:
    print(row)

conn.close()

Output:

('dd77885d07028e67dc9bcaaba4df34c6', 1, 'A01', 'SQ00014613', 'dd77885d07028e67dc9bcaaba4df34c6', 1, 1, 5708, 243.0, 0.0, 1.4548659444853675, 0.8954862097641124, 1.0, 0.6370535714285714, 0.37431869574347854, 132.6369155822652, 135.38463723776047, 34.132096331752024, 11.715673117787624, 10.770329614269007, 62.26089429533788, 59.03526641968096, 20.29822343213938, 437.7499999999988, 0.8719828903146959, 0.3965110655981401, 0.02726552700562998, 0.16402579895612554, 0.07484236835797209, 0.011261181254265978, 0.02861409064320637, 0.04053387192017339, 0.049241606505575276, 0.024285168451975588, 0.013861959187592107, 0.02810196587216954, 0.01599320656440029, 0.010569346758810652, 0.008887251305531873, 0.009462308794871515, 0.012065375209266845, 0.009298031039196929, 0.006355004846562031, 0.013377242477526025, 0.009729943368855912, 0.0062969001489844405, 0.005365364992531142, 0.006841116892832052, 0.0065943300626264125, 0.006362496202329899, 0.005017175325094574, 0.00961190989144846, 0.0031806037414069724, 0.007118798717474895, 0.006155671820943056, 1, 0.536486037526471, 0.2572140723965599, 0.15957150140534712, 0.578365646456109, 0.7709815050890838, 0.8764077377452895, 0.6432946203919064, 0.8340298787708805, 0.7610209260270792, 0.7877180188954657, 1.0, 1.0, 1.0, 1.0, 0.9917199240578392, 0.9985296888332263, 0.9477106907501512, 0.948508793282289, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.2182840461517548, 1.348079358507377, 0.28265494483370596, 0.36562293350731206, 0.6618282155910586, 0.7706916127527939, 0.16500082798987004, 0.2560582561032923, 0.6560639863426592, 0.9546960950149781, 0.20138371562952087, 0.2592786933789294, 3.0758753420528904, 5.011536499809363, 4.593880371303072, 1.1945965636894602, 2.48600391798919, 3.466529693971213, 3.5917595925544554, 0.7681761028529481, 0.5603002882237063, 0.9266639163492388, 1.0, 0.994643176982196, 1.0, 0.9998719929869592, 1.0, 1.0, 0.984928309079608, 0.5045954393800575, 1.0, 0.9995484560141301, 0.8787370504577903, 0.39495255316571504, 0.8130517805209783, 0.948472595331389, 0.9494880218351051, 0.5127255522638895, 0.9093469520096764, 1.0, 0.897939171852788, 0.8577740221969233, 0.9093446387206557, 0.9421430614129408, 0.9404394280593539, 0.9650216236379137, 0.9324223158651428, 0.9618381871866679, 0.9579459967954839, 0.953383472274341, 0.4726839774821032, 0.8070041710591029, 0.7964732680288036, 0.8661033992284836, 0.8422288751512221, 0.74403036440717, 0.6123140874512337, 0.7678073746372127, 0.8348067807431548, 0.3936724895635812, 0.8338569190333118, 0.8699950392787328, 0.7068858091293317, 0.2614640720455547, 0.6904224697526637, 0.8234360850788011, 0.8351135761257453, 0.40441811163662067, 0.8019515927956153, 0.8560020066262758, 0.0, 0.0, 0.0, 0.0, 0.0, 0.23998226242685422, 0.0, 0.0, 0.0, 0.36448961378825534, 0.0, 0.0, 0.0, 0.32131879421308207, 0.0, 0.0, 0.00234093279330587, 0.47218772243581003, 0.0, 0.0, 0.0016416752128641412, 0.15552733039652705, 0.024878143607935728, 0.0, 0.0066529849289034, 1.6274733995391224, 0.753266130681046, 5.809945433228337, 24.964891154978258, 17.907373040316223, 42.73600196196993, 22.662259832271545, 2.9211045484617526, 1.551241709432749, 2.109742011494935, 2.8773080618522213, 2.769763851799348, 3.7692193156465286, 2.906628088673556, 2.2592233477463166, 3.5410024031546916, 3.1408775538964586, 3.6105368020787054, 2.8544812486247233, 7.621913721875481, 9.028175268134385, 5.999700179025975, 2.134942233248072, 10.541181169833566, 8.66364441446608, 2.5365524568058073, 0.0, 0.0, 0.0, 1.0013996192722068, 0.03221494056215133, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.51159884314984, 4.159997373819351, 12.056438826024532, 4.308712580241263, 3.8843628433533013, 229.8038363037631, 186.92996568931267, 309.76168585103005, 75.5280707613565, 88.96708910772577, 0.022181114181876183, 0.007477130740880966, 0.02109163999557495, 0.009151910431683064, 0.007136610336601734, 0.015003345906734467, 0.004215342458337545, 0.027932574972510338, 0.0032777180895209312, 0.007500072941184044, 4.975741852967009, 18.139452620586187, 5.933663556273508, 2.2795176532936208, 6.383456330006232, 0.04148799553513527, 0.024201730266213417, 0.09171485900878906, 0.026532651856541634, 0.0223726537078619, 0.1372535079717636, 0.13023574650287628, 0.15726226568222046, 0.029055161401629448, 0.03708590194582939, 0.02222325336818148, 0.008794920452049368, 0.025489299843603663, 0.009109328922285968, 0.00821218360116977, 0.04025995730619535, 0.03274876764003375, 0.05426798981272426, 0.013231967547539681, 0.015586385618031844, 0.03652583435177803, 0.010713606141507626, 0.05082085356116295, 0.012628264725208282, 0.017082329839468002, 0.011087733320891857, 0.005130780395120382, 0.00952162966132164, 0.005633329972624779, 0.0042427461594343185, 0.011087733320891857, 0.005130780395120382, 0.00952162966132164, 0.005633329972624779, 0.004070483148097992, 0.00791720277652617, 0.0036992602058969933, 0.017286598913887657, 0.0037332977689093615, 0.004989590190947702, 0.021737146978144273, 0.03627437768511518, 0.03489388741697646, 0.005000040176600257, 0.007911664996981104, 0.052716948091983795, 0.05697867274284363, 0.0723976120352745, 0.015433413907885551, 0.021780159324407578, 273.07788190915977, 285.0357032005666, 273.85303553186725, 268.535067245884, 274.5404187029505, 31.758728370613717, 37.66648524498421, 29.359777355169243, 28.77413183953823, 31.033539912959398, 268.1569726699369, 31.02207428170988, 264.0, 281.0, 272.0, 259.0, 280.0, 30.0, 32.0, 17.0, 17.0, 38.0, 116.99764220563657, 116.99764220563657, 68.94788715913056, 68.94788715913056, 12, 12, 3.0, 3.0, 53.06553911205074, 39.112050739957716, 86.42778314796466, 86.42778314796466, 9, 9, 1, 1, 0.0186214, 0.0686626, 0.325688, 0.587028, 0.00670411, 0.0223097, 0.273271, 0.697715, 0.0228612, 0.0891808, 0.368636, 0.519322, 0.03686, 0.107989, 0.30474, 0.550411, 0.0249628, 0.0866878, 0.313968, 0.574381, 0.8571833754258683, 0.9051410416119617, 1.3045813968725362, 0.8992900200832035, 0.3086053730709627, 0.2940969769660091, 1.0946188469100406, 1.0688560040996586, 1.0523526329066377, 1.1756207357240294, 1.476614180698728, 0.7955688434586494, 1.6967507628423377, 1.423555331678915, 1.220671942275866, 0.8431952677295209, 1.1490928339020743, 1.142757953108042, 1.2576363098412218, 0.8799159186409801, 0.02916934440229026, 0.07719902232807221, 0.5328896701393664, 0.43106459500659655, 0.02518289555121029, 0.04168460553315476, 0.9757588426164225, 0.9288082265129017, 0.08152822303311796, 0.3438702148043925, 0.6508533344826879, 0.5112352814426918, 0.03526753219805956, 0.17826530830869847, 0.3548446029985446, 0.1913948355651762, 0.017218833562804816, 0.16342610885837425, 0.5049724741433912, 0.45368063069366826, 0.09964981847816033, 0.06350936275221367, 0.14507509118601228, 0.26635949966544886, 0.22701094498475832, 0.33300224319390487, 0.07103177121698967, 0.04216147481492235, 0.1079282321975739, 0.055556755873549794, 0.03946523443170271, 0.07744851996010499, 0.05308281609005266, 0.03952492379155175, 0.08369115216632042, 1.6244864420706655, 3.6990361445783133, 0.6303605313092979, 3.1536565324568615, 8.10746987951807, 0.9986717267552183, 2.5739523418241577, 6.85710843373494, 0.8366223908918404, 2.085866885784716, 5.798072289156626, 0.7546489563567365, 2.7641741988496302, 6.340963855421688, 1.0660341555977229, 0.599402178091247, 0.14344670192121567, 0.8403963651738701, 0.7089202302953103, 0.2896801917992731, 0.9035751091360864, 0.6485353425446521, 0.11917304193705833, 0.8838443720655799, 0.6615930689383772, 0.12558613574552505, 0.87523880540732, 0.6198245991976299, 0.15639813001790887, 0.8536838717027349, 1.1610598241759715, 1.501541486455717, 0.842746586768419, 1.349151448422967, 1.6423626917744856, 0.9725886333015297, 1.3995976520536817, 1.792046951156974, 0.9737345407101411, 1.2991366982975083, 1.7262491783718832, 0.9064966123353089, 1.4300448705887727, 1.7426331376356823, 1.0632285848133594, 1.0405242358831914, 1.8917932355929743, 0.45545564201073707, 2.0834433312245246, 4.788927275366526, 0.7157769264282808, 1.3122530111289734, 2.7004226447960518, 0.5062701050304432, 0.9031882959092937, 2.3304001161271595, 0.38617461626261573, 1.2832631935015382, 2.233260502249964, 0.5973159446798857, 2.7347385147670846, 3.024604575292702, 2.4080607558322926, 2.547519569710137, 2.677523442997522, 2.199939954948298, 3.2685065792551224, 3.534157718658248, 2.8653745899608687, 3.309716221412046, 3.6132452320518866, 2.9487451968007337, 3.3398358291106183, 3.4870255714261282, 2.9954644249082336, 4.753898800834961, 29.15551523749731, 0.525983248453914, 4.961131550485988, 18.68923319365492, 0.4819404921760388, 4.077661911958114, 33.37627255641695, 1.0739687583224815, 3.2147982276994993, 18.782473470397463, 0.9636235510277553, 7.498410444106709, 28.00772458997133, 0.8433149528914133, -0.25318817488709494, -0.10238340945450167, -0.4400369270160714, -0.2241362839853735, -0.08531031447544413, -0.424147345556269, -0.2128955248508834, -0.07973492745518458, -0.4129577832159757, -0.20304612826265353, -0.06946709763414742, -0.3809676447201931, -0.1589217160824194, -0.05502948354255445, -0.34062337871277676, 0.7401673776791327, 0.530314970942644, 0.8621101774699195, 0.6968413110507488, 0.47555179145412674, 0.8361080372685723, 0.7379447498148916, 0.5062603980959628, 0.8819063398010206, 0.7269607969713262, 0.4822829177722045, 0.8669168765757412, 0.6629502151660677, 0.42492428172699426, 0.8428361496866008, 0.693731249514516, 0.5205674270028628, 0.8115972764817502, 0.6712258635937884, 0.5696402739980344, 0.7804453622056033, 0.5767838353342728, 0.3827504434271266, 0.7388045540796964, 0.554498860412631, 0.41015661619205207, 0.7112523719165085, 0.5393947731421751, 0.3684698721509494, 0.6957986382408751, 4.965078060805258, 5.143855421686747, 4.827703984819734, 5.271569433032046, 5.32578313253012, 5.076470588235293, 6.357025472473294, 6.619518072289157, 6.139848197343453, 6.62736236647494, 6.929638554216868, 6.428652751423149, 7.11216105176664, 7.350120481927711, 6.874003795066414, 2.1525718956107114, 2.119914519415891, 2.075274453108321, 1.9582167458910282, 1.9992937320988102, 1.7954373778482182, 2.417541021044306, 2.384221557648187, 2.3370346369558064, 2.4204195318848813, 2.3478199618277698, 2.417596911155837, 2.420738034821996, 2.341542357314931, 2.375374651507048, 6.469610367776683, 4.900510436928437, 7.2553235708190345, 18.487466148219315, 14.52326294092031, 19.710850541355068, 11.933748917181884, 8.507040731601103, 13.48120149498252, 10.236243905666708, 7.442760110320801, 11.317015831980095, 11.600402642920177, 8.345608419219046, 13.371601237177249, 2.0691644796497712, 2.3039236463928003, 1.9969633707679066, 5.434359616471394, 5.378188067934388, 5.19095521173518, 3.7714625859755895, 4.239930846276674, 3.6417048633740086, 3.103225523584336, 3.4620995209754684, 3.0549775140514672, 3.76219069211911, 4.039400841921905, 3.700417169254921)
('1e5d8facac7508cfd4086f3e3e950182', 2, 'A01', 'SQ00014613', '1e5d8facac7508cfd4086f3e3e950182', 2, 1, 7599, 846.0, 56.0, 3.098240113819477, 0.9650724916945996, 1.0, 0.21223292836196062, 0.1685403972543454, 236.37968988822757, 271.0461215365385, 31.144823004794873, 9.50778239211758, 7.280109889280518, 78.76869206299587, 61.92745735700936, -23.115457730153054, 752.7159999999969, 0.5185437920092804, 0.13169829989985163, 0.043828923243508636, 0.048788702375883876, 0.03730111355828578, 0.027772623762352695, 0.017169195176328266, 0.006880767148587738, 0.03428043148462124, 0.016288640182239116, 0.011322703300943253, 0.01810921627842668, 0.007929361430209404, 0.004056527447131579, 0.014442633016706402, 0.018857302057022504, 0.008605628729627201, 0.011185953353838157, 0.007039058354387612, 0.009794372671794635, 0.004120900554933969, 0.001997321368504394, 0.0074118454825303015, 0.010053486532164169, 0.010129494647168575, 0.005218999860644149, 0.008339675170693924, 0.009098186994678039, 0.0023068662806089, 0.005344177914159302, 0.002326570915540779, 1, 0.631165713084948, 0.3433915790284571, 0.2835769703097421, 0.5444030135566944, 0.8496905088574255, 0.9331258764137348, 0.7354904150276513, 0.8824369231015408, 0.8485118625145492, 0.86769649362452, 1.0, 1.0, 1.0, 1.0, 0.9997615988090512, 1.0, 0.989739409418997, 0.9752734568764693, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.028449497983922, 1.3684388330546005, 0.37279846623435803, 0.38742803973403167, 0.7495077175820077, 0.8844853320516137, 0.24170187804513377, 0.2836631183935993, 0.678785407498096, 0.6978578901535024, 0.2590734188813756, 0.27831616082495464, 2.441297318822198, 3.0798813142824515, 3.597238902414282, 0.9868701798151621, 2.4422379821421414, 2.6834316859125025, 3.4574560725424996, 0.9410406450651144, 0.5721327914103581, 0.8813255760300502, 1.0, 0.9703631252788006, 1.0, 0.9994811036518925, 1.0, 1.0, 1.0, 0.6028757720101519, 1.0, 1.0, 0.9872960454197048, 0.45657634436222855, 0.839770177168858, 0.9532600947549826, 0.9985086208988384, 0.5745523002006628, 0.9342248783647363, 1.0, 0.8779697237845342, 0.7856494559899213, 0.8627937747910467, 0.8724623774250291, 0.9637822943648561, 0.9809515280231509, 0.9539978490955658, 0.9653750467986807, 0.9636830135520629, 0.9727237398076223, 0.4934944356781062, 0.7946389369961598, 0.8697635721389528, 0.8736429220395614, 0.8657406198954486, 0.7852984294187858, 0.7443155794241724, 0.7910793646688946, 0.8936048437736919, 0.4890853877193923, 0.9067785396171582, 0.9084966142409757, 0.8500392024893181, 0.3610675447705276, 0.758836886502513, 0.8570259527387171, 0.9046821674802671, 0.4722729934576858, 0.8511555084925287, 0.9173977818486672, 0.0, 4.315353438825353, 4.577540714898429, 5.412196159128885, 8.29957619674202, 4.764161803410123, 3.4386043331281915, 8.486042475523888, 5.008530986599481, 3.9801924657934986, 5.363416684620023, 5.487719827656264, 3.235544999142868, 5.092967291094931, 1.1023586306616733, 3.5780133711173905, 4.6527396680210815, 5.543371797595361, 4.610577839681714, 8.078208313241815, 1.837603602812149, 7.2861816482268, 2.726090778981726, 4.449049773725469, 4.267129204559089, 5.417889881131498, 6.819230330844017, 6.468536777487629, 21.364668397285172, 13.556349292359887, 36.75382762295642, 16.118670447578936, 2.039224495518216, 0.6030892358909797, 0.7253514925667846, 1.3918577429656, 2.2555594538109283, 2.2388194880491348, 0.6642923797237145, 2.03431309133248, 0.6794601338198211, 2.866205048286896, 0.8657400363390089, 0.8151995430556224, 0.0392139125283895, 3.5800411289406795, 0.7765868319596425, 1.3654034917553786, 0.8222905457011641, 3.111797050625308, 1.0270401510590115, 0.627662010728548, 2.7128490546580464, 3.6989115311574134, 0.0593863031152175, 0.0, 1.5443265493340912, 3.1294927082479407, 0.0, 0.0, 2.753597363390737, 0.0, 1.8410526178036806, 4.144391476635116, 20.79535900056362, 6.316869001835585, 20.450178103521466, 9.9103292520158, 6.1753686354495585, 289.88819244969636, 221.8726501883939, 379.34397222008556, 120.67081449227408, 112.47641657944769, 0.023408387321978807, 0.006924536777660251, 0.02181517519056797, 0.011901182355359197, 0.006715086637996137, 0.011015454307198524, 0.007273187628015876, 0.02318941242992878, 0.0037211156450212, 0.00751719600521028, 16.387914552692077, 24.410087161675055, 24.293513795206266, 7.95507666628643, 21.83291842791895, 0.06936314702033997, 0.019123665988445282, 0.10909667611122131, 0.025577789172530174, 0.02704603224992752, 0.10498858243227005, 0.11764924973249435, 0.1408848613500595, 0.02870221808552742, 0.032572224736213684, 0.024013116628826353, 0.00729430600673855, 0.02361452436896243, 0.011443798212489375, 0.007130910664491407, 0.03814820271742286, 0.02919761155262454, 0.0499202490090914, 0.015879828200062385, 0.014801476059935214, 0.031702300533652306, 0.013282337691634893, 0.04556192085146904, 0.015705588273704052, 0.014886246528476477, 0.01222323253750801, 0.00462990440428257, 0.009919397532939911, 0.0067625283263623714, 0.004280088935047388, 0.01222323253750801, 0.00462990440428257, 0.009919397532939911, 0.0067625283263623714, 0.004162806551903486, 0.011964853872877542, 0.0030507166779642557, 0.023175154287040298, 0.004302080405495599, 0.005884267378647776, 0.018609992649856185, 0.03086609586382904, 0.03153596024294386, 0.0048863365654005985, 0.007990771945108565, 0.05131582356989384, 0.04634108394384384, 0.06741555966436863, 0.01931737270206213, 0.021065232809633017, 826.325692478303, 833.1135968041906, 833.7321957259184, 818.1497917906489, 831.1926256378051, 60.12808425566439, 55.3644261293524, 57.333761796452464, 62.2589515340809, 57.51757888119508, 810.6603500460587, 64.94065008553757, 863.0, 844.0, 870.0, 874.0, 869.0, 59.0, 47.0, 53.0, 51.0, 54.0, 55.663109435436716, 55.663109435436716, 95.53339066316965, 95.53339066316965, 4, 4, 2.0, 2.0, 24.13394919168591, 17.551963048498845, 147.63830774374063, 147.63830774374063, 16, 16, 1, 1, 0.0421927, 0.164957, 0.303692, 0.489159, 0.083596, 0.284376, 0.379616, 0.252412, 0.0366385, 0.143353, 0.32564, 0.494369, 0.0309273, 0.118577, 0.259264, 0.591232, 0.0418377, 0.153138, 0.31639, 0.488635, 1.6111678061211092, 1.6893613278624788, 1.3695857468804535, 0.7474602511980006, 3.192189498313081, 2.9123624832321626, 1.7119908861484283, 0.38569824321754675, 1.3990761145163002, 1.4681081921866286, 1.4685685934581798, 0.7554211903392678, 1.180987221427514, 1.214373630745148, 1.1692279618819308, 0.903432387678095, 1.597609893030394, 1.5683192992105925, 1.4268509310501605, 0.7466596670908726, 0.06914707046208922, 0.17745107507255983, 0.2528817076656201, 0.3773168937738883, 0.06681767006993104, 0.2145249962361911, 0.42859709460075013, 0.3569557863048874, 0.13508122306521253, 0.2697653768543242, 0.3611699080566097, 0.6216311984045433, 0.06636422532157128, 0.1254650694250132, 0.19365057220929463, 0.3401012556759001, 0.09191318889793587, 0.11304741316469216, 0.21562194767343032, 0.5175515264055777, 0.08940657349738132, 0.06696170583910666, 0.11800089868057352, 0.20598299511532, 0.10654468624694227, 0.2944095205675304, 0.0734958045196477, 0.05278481516443598, 0.10844782614615814, 0.04397667500146427, 0.03842616312293305, 0.06002438964611048, 0.053209758707757494, 0.03752578723893863, 0.08221384785065564, 0.9747167952717123, 1.8623704004879038, 0.4268634686346864, 1.8771958627483172, 6.688351290912787, 0.5682656826568265, 1.1231324905598423, 2.990445212441553, 0.4134317343173432, 1.4546051551469383, 3.02642813579996, 0.6090036900369004, 1.4137251682810705, 3.1725960561089654, 0.5951291512915129, 0.8361467472053523, 0.717065183459223, 0.9246360615087273, 0.8113452374927373, 0.3906290471240371, 0.939927851890886, 0.8669329526874806, 0.6795419497246803, 0.9484419417529136, 0.7773428742468751, 0.5588908584719398, 0.9063287967489694, 0.8725331921841761, 0.7462481094733988, 0.9438989875961901, 1.011756304368103, 1.239017861803883, 0.739894409382499, 1.2390939501875116, 1.7270022108476644, 0.812681129780554, 1.0682960508585166, 1.4388668967886598, 0.7213136370520483, 1.1557684549660798, 1.4459284822167058, 0.832603250432006, 1.143897061241691, 1.4619804110030523, 0.8459076652881273, 0.5607425453192658, 0.7903730616925272, 0.29351722334935526, 1.1859855724542894, 3.3049653626397557, 0.41619446358301226, 0.5648725409923445, 1.233567447869968, 0.2669490012390899, 0.663962733867164, 1.2201664525797695, 0.3300949823667978, 0.6471301928953045, 1.2446815892555971, 0.352963472719598, 2.743406701979367, 2.920533930553912, 2.4977600476955724, 2.6573977324459, 3.2117052687658836, 2.205999976233847, 2.9786295382892334, 3.2088602628912497, 2.632412112569388, 3.3545190441791757, 3.49133171496305, 3.0252348163536635, 3.231984901685261, 3.4465082377145086, 2.899247862992047, 4.157696701772123, 15.91004253632042, 0.9151817828734365, 2.5702256340197254, 11.585981546916745, 0.8895406196081981, 3.2910530163564222, 37.51163154236934, 1.0193001597598084, 4.5058070603883555, 47.88786951678423, 1.7689667451470172, 3.1195764078578496, 41.48741086084037, 1.4138943375052764, -0.42621461882063016, -0.29126652039074263, -0.5569951838683634, -0.31297163482522783, -0.10229934288878048, -0.5078899001572509, -0.3826200484191476, -0.19435228984560696, -0.5642814621469737, -0.25213039245140745, -0.12373341281418418, -0.4375456556825362, -0.3661029776877634, -0.2240285450066806, -0.5219302999830021, 0.882845957515022, 0.8057441048675572, 0.9257013333883004, 0.7918730643758006, 0.5453712160230554, 0.8816687785225277, 0.8764473134556314, 0.724537177375421, 0.9377497677869624, 0.7896045604050744, 0.6150863887152043, 0.9040357914255417, 0.8782566578713791, 0.7672508309859039, 0.9352214467893674, 0.7110558490347378, 0.5642957081185799, 0.8235867158671586, 0.684772949630454, 0.4884496832047147, 0.822848708487085, 0.6640124774257101, 0.4969203356823987, 0.8117047970479705, 0.6114421470443374, 0.4903041029381868, 0.744029520295203, 0.6159009918201397, 0.47697272555027004, 0.7642509225092251, 5.881300279100313, 6.1404757064444, 5.689446494464944, 5.3807256608110325, 6.047977231144542, 5.041771217712177, 6.37580036118864, 6.596462695669852, 6.161918819188191, 8.065342308323755, 8.14982720065054, 7.903468634686348, 7.734197997044822, 8.132140678999797, 7.432029520295203, 2.3527314739409513, 2.3533869874753863, 2.277963970078134, 2.0945253146747116, 2.332728478148612, 1.8872052159444188, 2.4672111633179625, 2.42613380237053, 2.396904470788042, 2.542342223801617, 2.4358171120509327, 2.5787425642017756, 2.5957053743167244, 2.5930033274123048, 2.5457362345118377, 10.034342489211875, 8.372419452499882, 10.41455283016299, 18.02365739493995, 15.260149903120531, 18.351170294522134, 12.781611061112987, 9.16120724759192, 14.107066428834031, 10.579870917958814, 8.397759169164576, 11.81916878038153, 16.925409055527098, 13.428767749268172, 18.691431674405308, 2.7992499571633975, 2.249526616612997, 2.7415008837025643, 4.974431487459625, 5.556940095743357, 4.7301603409539625, 3.355673613361581, 2.234308999530967, 3.5971282308247434, 2.8887082553363945, 2.4083143881479683, 3.079375280837679, 4.611073916124787, 3.7562098184117985, 4.869017055867975)

Hi @shntnu thanks for opening this issue (sorry to hear it's happening) and for the thorough notes. Aside, I'm so excited to hear about Parquet being used for Cell Painting datasets in the future!

I took a look at what you provided and tried to reproduce the findings. Here's what I noticed:

DuckDB Duplicate Column Handling

When performing a SQL-based JOIN operation with DuckDB it can sometimes add additional column names per table where the column names are the same between tables. This is generalized SQL behavior which seeks to preserve the data output (leaning on a developer to specify / filter using other means). In DuckDB, this behavior is outlined here, which mentions the addition of integers per every duplicate column name.

We can see an alignment in what you shared through the column names:

Metadata_TableNumber: string <-----
Metadata_ImageNumber: int64 <-----
...
Metadata_TableNumber_1: string <-----
Metadata_ImageNumber_1: int64 <-----

And also column values (notice the repeated values for 'dd77885d07028e67dc9bcaaba4df34c6', 1):

('dd77885d07028e67dc9bcaaba4df34c6', 1, 'A01', 'SQ00014613', 'dd77885d07028e67dc9bcaaba4df34c6', 1, ...
('1e5d8facac7508cfd4086f3e3e950182', 2, 'A01', 'SQ00014613', '1e5d8facac7508cfd4086f3e3e950182', 2, ...

My first instinct was to mention the above and a workaround within DuckDB SQL which can be implemented: the EXCLUDE clause. EXCLUDE allows you to specify certain columns which should be excluded (in this case, perhaps the cell.Metadata_TableNumber and cell.Metadata_ImageNumber columns). I found that I didn't need EXCLUDE to reach a successful run (see below).

Environment

I tried adding the code you provided to see what would happen within a Google Colab environment. Here's a link to the Google Colab notebook and a gist with similar content (as a backup).

I didn't see the same errors which made me think that DuckDB, Arrow, or something in between might be operating differently depending on your environment. When you have the chance, could you share an environment lockfile, the output from pip freeze > env-freeze.txt, or similar to help observe any differences which may be contributing to this? It could also be that EXCLUDE could serve well here, but I'm not certain without more detail.

@d33bs Thank you so much for looking into this. I'll drop in my env right away and then follow up later in case I have more to add.

EXCLUDE sounds promising

env-freeze.txt

Thank you @shntnu ! I found that Google Colab was using duckdb==0.9.2 and your environment shows the use of duckdb==0.10.0. I imagine a quick workaround may be to pin duckdb==0.9.2. When you have the chance, would you mind testing this to confirm?

I'm digging into the specifics of the issue and giving EXCLUDE a try. I'll follow up with more detail as soon as I have a better understanding of this.

As a quick follow up, I found that EXCLUDE may work here using the following modified snippet. Related, I'm finding that we're no longer passing tests with duckdb==0.10.0 in CytoTable which seems to be for related reasons (a change in how SQL is processed for existing joins).

from cytotable import convert
import logging

logging.basicConfig(level=logging.ERROR)

identifying_cols = (
    "TableNumber",
    "ImageNumber",
    "ObjectNumber",
    "Metadata_Well",
    "Metadata_Plate",
    "Parent_Cells",
    "Parent_Nuclei",
    "Cytoplasm_Parent_Cells",
    "Cytoplasm_Parent_Nuclei",
)

join_command = """
                    WITH Image_Filtered AS (
                        SELECT
                            Metadata_TableNumber,
                            Metadata_ImageNumber,
                            Image_Metadata_Well,
                            Image_Metadata_Plate
                        FROM
                            read_parquet('image.parquet')
                        )
                    SELECT
                        image.Metadata_TableNumber,
                        image.Metadata_ImageNumber,
                        image.Image_Metadata_Well,
                        image.Image_Metadata_Plate,
                        cells.* EXCLUDE(
                            Metadata_TableNumber,
                            Metadata_ImageNumber
                        )
                    FROM
                        Image_Filtered AS image
                    LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
                        cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
                        AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
                    LEFT JOIN read_parquet('cells.parquet') AS cells ON
                        cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
                        AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
                        AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
                    LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON
                        nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
                        AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
                        AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei
                """

source_path = "test_SQ00014613.sqlite"
dest_path = "test_SQ00014613.parquet"

x = convert(
    source_path=source_path,
    dest_path=dest_path,
    identifying_columns=identifying_cols,
    dest_datatype="parquet",
    chunk_size=5000,
    preset="cell-health-cellprofiler-to-cytominer-database",
    joins=join_command,
)

When you have the chance, would you mind testing this to confirm?

Thank you for looking this up. I can confirm that the code worked fine when pegged it to v0.9.2 using mamba install python-duckdb=0.9.2

sqlite3 test_SQ00014613.sqlite .schema | grep -E 'Cells_|Cytoplasm_|Nuclei_'| awk -F\" '{print $2}'|sort > feats_sqlite.txt
python -c "import polars as pl; df = pl.read_parquet('test_SQ00014613.parquet'); print('\n'.join([name for name in df.columns if name.startswith(('Cells', 'Cytoplasm', 'Nuclei'))]))"|sort > feats_parquet.txt
diff feats_sqlite.txt feats_parquet.txt 
907,908d906
< Cytoplasm_Parent_Cells
< Cytoplasm_Parent_Nuclei

I can also confirm that EXCLUDE fix worked with v0.10, although it needed an extra exclude of Metadata_ObjectNumber to allow all compartments

This gives us everything we need to proceed. Thank you @d33bs!

Some notes/questions

The API docs were not as helpful as the presets.py file in understanding how to configure.
I am still confused about what columns to include in identifying_columns and what to exclude. E.g., Cells_ObjectNumber and Nuclei_ObjectNumber was not in there before, and it worked fine. But then I added it after seeing the presets.py file, and it still worked fine :)
Should we create a new preset based on this config below and create a PR?
Anything we should keep in mind as we attempt convert ~3000 plates using this?

Note: Updated version is in https://github.com/cytomining/CytoTable/issues/163#issuecomment-2028053389

Config

```py { "jump-cytominer-database-to-parquet": { # version specifications using related references "CONFIG_SOURCE_VERSION": { "cell-health-dataset": "v5", # TODO: Update this "cellprofiler": "v2.X", # TODO: Update this "cytominer-database": "5aa00f58e4a31bbbd2a3779c87e7a3620b0030db", # TODO: Update this }, # names of source table compartments (for ex. cells.csv, etc.) "CONFIG_NAMES_COMPARTMENTS": ("cells", "nuclei", "cytoplasm"), # names of source table metadata (for ex. image.csv, etc.) "CONFIG_NAMES_METADATA": ("image",), # column names in any compartment or metadata tables which contain # unique names to avoid renaming "CONFIG_IDENTIFYING_COLUMNS": ( "TableNumber", "ImageNumber", "Metadata_Well", "Metadata_Plate", "Parent_Cells", "Parent_Nuclei", "Cytoplasm_Parent_Cells", "Cytoplasm_Parent_Nuclei", "Cells_ObjectNumber", "Nuclei_ObjectNumber", ), # chunk size to use for join operations to help with possible performance issues # note: this number is an estimate and is may need changes contingent on data # and system used by this library. "CONFIG_CHUNK_SIZE": 5000, # compartment and metadata joins performed using DuckDB SQL # and modified at runtime as needed "CONFIG_JOINS": """ WITH Image_Filtered AS ( SELECT Metadata_TableNumber, Metadata_ImageNumber, Image_Metadata_Well, Image_Metadata_Plate FROM read_parquet('image.parquet') ) SELECT image.*, cells.* EXCLUDE( Metadata_TableNumber, Metadata_ImageNumber ), nuclei.* EXCLUDE( Metadata_TableNumber, Metadata_ImageNumber, Metadata_ObjectNumber ), cytoplasm.* EXCLUDE( Metadata_TableNumber, Metadata_ImageNumber, Metadata_ObjectNumber ), FROM Image_Filtered AS image LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber LEFT JOIN read_parquet('cells.parquet') AS cells ON cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber AND cells.Cells_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber AND nuclei.Nuclei_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei """, }, } ```

# Author: Zitong (Sam) Chen, Broad Institute, 2023
#
# Download sample SQLite file:
# wget https://raw.githubusercontent.com/d33bs/pycytominer/43cf984067700aa52f0b6752e3490d9e12d60170/tests/test_data/cytominer_database_example_data/test_SQ00014613.sqlite -O test_SQ00014613.sqlite

from cytotable import convert
import logging

logging.basicConfig(level=logging.ERROR)

identifying_cols = (
    "TableNumber",
    "ImageNumber",
    "ObjectNumber",
    "Metadata_Well",
    "Metadata_Plate",
    "Parent_Cells",
    "Parent_Nuclei",
    "Cytoplasm_Parent_Cells",
    "Cytoplasm_Parent_Nuclei",
    "Cells_ObjectNumber",
    "Nuclei_ObjectNumber",
)

join_command = """
                    WITH Image_Filtered AS (
                        SELECT
                            Metadata_TableNumber,
                            Metadata_ImageNumber,
                            Image_Metadata_Well,
                            Image_Metadata_Plate
                        FROM
                            read_parquet('image.parquet')
                        )
                    SELECT
                        image.*,
                        cells.* EXCLUDE(
                            Metadata_TableNumber,
                            Metadata_ImageNumber
                        ),
                        nuclei.* EXCLUDE(
                            Metadata_TableNumber,
                            Metadata_ImageNumber,
                            Metadata_ObjectNumber
                        ),
                        cytoplasm.* EXCLUDE(
                            Metadata_TableNumber,
                            Metadata_ImageNumber,
                            Metadata_ObjectNumber
                        ),
                    FROM
                        Image_Filtered AS image
                    LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
                        cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
                        AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
                    LEFT JOIN read_parquet('cells.parquet') AS cells ON
                        cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
                        AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
                        AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
                    LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON
                        nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
                        AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
                        AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei
                """

source_path = "test_SQ00014613.sqlite"
dest_path = "test_SQ00014613.parquet"

x = convert(
    source_path=source_path,
    dest_path=dest_path,
    identifying_columns=identifying_cols,
    dest_datatype="parquet",
    chunk_size=5000,
    preset="cell-health-cellprofiler-to-cytominer-database",
    joins=join_command,
)

Thanks @shntnu ! I've opened duckdb/duckdb#11157 as a result of our findings here.

Addressing your comments:

The API docs were not as helpful as the presets.py file in understanding how to configure.

Thank you for the feedback here! Would you have any recommendations for how to improve or what you found most helpful as you found what you needed? Generally we hope to improve documentation via, for example, #25, but we could get more specific here.

I am still confused about what columns to include in identifying_columns and what to exclude. E.g., Cells_ObjectNumber and Nuclei_ObjectNumber was not in there before, and it worked fine. But then I added it after seeing the presets.py file, and it still worked fine :)

identifying_columns are used to help CytoTable understand which columns should not be renamed during renaming operations. These are used as a type of flag, where if provided and they exist in the source data the columns are named differently. If an identifying column is provided and it doesn't exist in the source, nothing will happen. Generally, I feel that this functionality needs improvement in the form of documentation and procedural sequence (these occur before JOIN operations, so column names must currently be inferred from CytoTable instead of from the source). From a provenance standpoint, I also wonder if/how we could improve things here to make the functionality more automatic or induce less friction from decision making here.

Should we create a new preset based on this config below and create a PR?

My feeling is that we need feedback on whether the new behavior in duckdb==0.10.0 is expected and will remain consistent in the future. I think we should treat this as a bug in the new duckdb release (until proven otherwise) because I couldn't find documentation to fully support the new behavior. I recommend pinning to duckdb==0.9.2 when using cytotable==0.0.4 until we have more information. Something I don't know based on the data you plan to send through CytoTable: would this enable you to use the existing join SQL from the preset? If it doesn't, please don't hesitate to advise about or open a PR to add a new preset which would more directly meet the needs.

Anything we should keep in mind as we attempt convert ~3000 plates using this?

See above on pinning duckdb.
Consider increasing your chunk_size to a higher number to improve time duration performance, contingent on what might work best for the data involved. What would work best will depend on the amount of system memory available where you run CytoTable, the shape of the source data (row and column size), and the complexity of the join operations. It may be worth a quick test using chunk sizes [10000, 100000, 500000, 1000000] to see what might perform the best.

Hi @shntnu - thanks to review from @gwaybio in #166 and #167 we have now released a fix for the duckdb==0.10.0 issues by setting a version constraint within the latest version of CytoTable, cytotable==0.0.5 (avoiding the need to manually pin duckdb). Please feel free to use this new release to help address the challenges you were facing in this issue.

I'd like to leave this issue open for the time being to help acknowledge that we will eventually need to update the constraint once a fix is available from a new duckdb. It looked like a fix might be available in next releases as of linked items in https://github.com/duckdb/duckdb/issues/11157.

Please feel free to use this new release to help address the challenges you were facing in this issue.

Thank you @d33bs – I can't believe how quickly this got addressed, both by you as well as the duckdb team!

Thanks as well for all your notes in https://github.com/cytomining/CytoTable/issues/163#issuecomment-1997734313. I'll read those carefully and get back to you.

Hi @shntnu - as a heads up, in addition to what I mentioned earlier, I recommend using cytotable==0.0.6 to help address issues with memory during post join concatenation (as per #168). I've also noticed there may be issues with the default Parsl executor for CytoTable, HighThroughputExecutor (HTE), documented as part of #169 (these errors may have resulted for reasons related to the HTE or perhaps resources, but I'm not certain at this time).

As a result, I might recommend using the ThreadPoolExecutor instead which may be configured as follows:

import cytotable 
import parsl
from parsl.config import Config
from parsl.executors import ThreadPoolExecutor

cytotable.convert(
    ...
    parsl_config=parsl.load(
        Config(
            executors=[
                ThreadPoolExecutor(
                    # set maximum number of threads at any time, for example 3.
                    # if not set, the default is 2.
                    max_threads=3, 
                )
            ]
        )
    ),
)

@d33bs -- just a heads up that I've haven't been able to return to this and I might need to push it out a couple of weeks. Would that block you?

Thank you again for the quick and thorough response!

Thank you @shntnu for the updates here, I don't feel blocked here.

Once #174 is merged I feel we should focus the remainder of questions / considerations here in new issues to help uncover any additional insights which may benefit your work with CytoTable (the original challenges with DuckDB and Arrow will have been resolved I feel). The following are issues / additional focus areas I can think of related to our discussion here (please feel free to add / suggest / etc):

Determine the best settings for Parsl configuration with CytoTable (and make subsequent changes within documentation or possibly implement automated configuration). This touches on aspects of #25 but is a bit of a deeper dive into system resources and how they operate in conjunction with CytoTable through Parsl.
- I've recently heard that the HTE (highthroughput executor, default configuration) sometimes works better than the TPE (threadpool executor).
Some of the best practices when it comes to CytoTable could be outlined in notebook-style documentation to enhance the ability to use the tool. #25 would be where I'd hope we can achieve this.

Closing this out with the creation of #176 and merge of #174

Thank you for the feedback here! Would you have any recommendations for how to improve or what you found most helpful as you found what you needed? Generally we hope to improve documentation via, for example, #25, but we could get more specific here.

I found this hard to pin down but in general, I'd say it's the difference between experiential learning vs. cognitive learning. Examples make things more concerete, allow you to apply it immediately, help learn the API within a real-world context, and boy do they reduce cognitive load :D API docs will pair well if they are concise and complete, so it's perfectly fine if I couldn't get it all from the API docs. I've forgotten the context to be more specific but hope that helps.

`identifying_columns`` are used to help CytoTable understand which columns should not be renamed during renaming operations.

Could you clarify what are these renaming operations? Maybe the notes below would help understand my confusion

The exercise made me realize that our lab should carefully think through the join because a lot could go wrong (this is not related to CytoTable per se)

That is, we should think through this snippet below and revise based on how our data are structured.

The first join (with image) should be an inner join I think. We don't want rows that are image only
I think the join with image should be with nuclei first because of the way the cells,nuclei,cytoplasm object hierarchy is structured. That is, join image with nuclei, then cytoplasm to nuclei, and cells to nuclei. Here, we are anchoring on cytoplasm

WITH Image_Filtered AS (
    SELECT
        Metadata_TableNumber,
        Metadata_ImageNumber,
        Image_Metadata_Well,
        Image_Metadata_Plate
    FROM
        read_parquet('image.parquet')
    )
SELECT
    image.*,
    cells.* EXCLUDE(
        Metadata_TableNumber,
        Metadata_ImageNumber
    ),
    nuclei.* EXCLUDE(
        Metadata_TableNumber,
        Metadata_ImageNumber,
        Metadata_ObjectNumber
    ),
    cytoplasm.* EXCLUDE(
        Metadata_TableNumber,
        Metadata_ImageNumber,
        Metadata_ObjectNumber
    ),
FROM
    Image_Filtered AS image
LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
    cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
    AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
LEFT JOIN read_parquet('cells.parquet') AS cells ON
    cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
    AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
    AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON
    nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
    AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
    AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei

I ran the code below not knowing what I should include

```py # Author: Zitong (Sam) Chen, Broad Institute, 2023 # # Download sample SQLite file: # wget https://raw.githubusercontent.com/d33bs/pycytominer/43cf984067700aa52f0b6752e3490d9e12d60170/tests/test_data/cytominer_database_example_data/test_SQ00014613.sqlite -O test_SQ00014613.sqlite from cytotable import convert import logging logging.basicConfig(level=logging.ERROR) identifying_cols = ( "TableNumber", "ImageNumber", "ObjectNumber", "Metadata_Well", "Metadata_Plate", "Parent_Cells", "Parent_Nuclei", "Cytoplasm_Parent_Cells", "Cytoplasm_Parent_Nuclei", "Cells_ObjectNumber", "Nuclei_ObjectNumber", ) join_command = """ WITH Image_Filtered AS ( SELECT Metadata_TableNumber, Metadata_ImageNumber, Image_Metadata_Well, Image_Metadata_Plate FROM read_parquet('image.parquet') ) SELECT image.*, cells.* EXCLUDE( Metadata_TableNumber, Metadata_ImageNumber ), nuclei.* EXCLUDE( Metadata_TableNumber, Metadata_ImageNumber, Metadata_ObjectNumber ), cytoplasm.* EXCLUDE( Metadata_TableNumber, Metadata_ImageNumber, Metadata_ObjectNumber ), FROM Image_Filtered AS image LEFT JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber LEFT JOIN read_parquet('cells.parquet') AS cells ON cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber AND cells.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Cells LEFT JOIN read_parquet('nuclei.parquet') AS nuclei ON nuclei.Metadata_TableNumber = cytoplasm.Metadata_TableNumber AND nuclei.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber AND nuclei.Metadata_ObjectNumber = cytoplasm.Metadata_Cytoplasm_Parent_Nuclei """ source_path = "test_SQ00014613.sqlite" dest_path = "test_SQ00014613.parquet" x = convert( source_path=source_path, dest_path=dest_path, identifying_columns=identifying_cols, dest_datatype="parquet", chunk_size=5000, preset="cell-health-cellprofiler-to-cytominer-database", joins=join_command, ) ```

and then I checked the output to see which of the columns were present or renamed

```py import polars as pl # Load the parquet file df = pl.read_parquet("test_SQ00014613.parquet") # Columns to check for in the parquet file identifying_cols = ( "TableNumber", "ImageNumber", "ObjectNumber", "Metadata_Well", "Metadata_Plate", "Parent_Cells", "Parent_Nuclei", "Cytoplasm_Parent_Cells", "Cytoplasm_Parent_Nuclei", "Cells_ObjectNumber", "Nuclei_ObjectNumber", ) identifying_cols_metadata_suffix = [f"Metadata_{col}" for col in identifying_cols] identifying_cols_image_suffix = [f"Image_{col}" for col in identifying_cols] # Check if each column is present in the DataFrame columns_present_1 = {col: col in df.columns for col in identifying_cols_metadata_suffix} columns_present_2 = {col: col in df.columns for col in identifying_cols_image_suffix} columns_present_3 = {col: col in df.columns for col in identifying_cols} filt = lambda d: {key: value for key, value in d.items() if value} import json print(json.dumps(identifying_cols, indent=4)) print(json.dumps(filt(columns_present_1), indent=4)) print(json.dumps(filt(columns_present_2), indent=4)) print(json.dumps(filt(columns_present_3), indent=4)) ```

[
    "TableNumber",
    "ImageNumber",
    "ObjectNumber",
    "Metadata_Well",
    "Metadata_Plate",
    "Parent_Cells",
    "Parent_Nuclei",
    "Cytoplasm_Parent_Cells",
    "Cytoplasm_Parent_Nuclei",
    "Cells_ObjectNumber",
    "Nuclei_ObjectNumber"
]
{
    "Metadata_TableNumber": true,
    "Metadata_ImageNumber": true,
    "Metadata_ObjectNumber": true,
    "Metadata_Cytoplasm_Parent_Cells": true,
    "Metadata_Cytoplasm_Parent_Nuclei": true
}
{
    "Image_Metadata_Well": true,
    "Image_Metadata_Plate": true
}
{}

From that, it looks like these columns are not present in any of the tables in the SQLite, which is true

    "Parent_Cells",
    "Parent_Nuclei",
    "Cells_ObjectNumber",
    "Nuclei_ObjectNumber"

sqlite3 test_SQ00014613.sqlite .schema|grep -e "Parent_Cells" -e "Parent_Nuclei"  -e "Cells_ObjectNumber" -e "Nuclei_ObjectNumber"

        "Cytoplasm_Parent_Cells" BIGINT NOT NULL, 
        "Cytoplasm_Parent_Nuclei" BIGINT NOT NULL, 
        "Cells_Parent_Nuclei" BIGINT NOT NULL,

So that means I should change my cols to

identifying_cols = (
    "TableNumber",
    "ImageNumber",
    "ObjectNumber",
    "Metadata_Well",
    "Metadata_Plate",
    "Cytoplasm_Parent_Cells",
    "Cytoplasm_Parent_Nuclei",
)

Please feel free to use this new release to help address the challenges you were facing in this issu

It worked great!

I factored in all your advice in https://github.com/cytomining/CytoTable/issues/163#issuecomment-1997734313 to create a new script

# Author: Zitong (Sam) Chen, Broad Institute, 2023
#
# Download sample SQLite file:
# wget https://raw.githubusercontent.com/d33bs/pycytominer/43cf984067700aa52f0b6752e3490d9e12d60170/tests/test_data/cytominer_database_example_data/test_SQ00014613.sqlite -O test_SQ00014613.sqlite

from cytotable import convert
import logging
import parsl
from parsl.config import Config
from parsl.executors import ThreadPoolExecutor

logging.basicConfig(level=logging.ERROR)

identifying_cols = (
    "TableNumber",
    "ImageNumber",
    "ObjectNumber",
    "Metadata_Well",
    "Metadata_Plate",
    "Cytoplasm_Parent_Nuclei",
    "Cells_Parent_Nuclei",
)

join_command = """
                    WITH Image_Filtered AS (
                        SELECT
                            Metadata_TableNumber,
                            Metadata_ImageNumber,
                            Image_Metadata_Well,
                            Image_Metadata_Plate
                        FROM
                            read_parquet('image.parquet')
                        )
                    SELECT
                        image.*,
                        nuclei.* EXCLUDE(
                            Metadata_TableNumber,
                            Metadata_ImageNumber
                        ),
                        cells.* EXCLUDE(
                            Metadata_TableNumber,
                            Metadata_ImageNumber,
                            Metadata_ObjectNumber
                        ),
                        cytoplasm.* EXCLUDE(
                            Metadata_TableNumber,
                            Metadata_ImageNumber,
                            Metadata_ObjectNumber
                        ),
                    FROM
                        Image_Filtered AS image
                    INNER JOIN read_parquet('nuclei.parquet') AS nuclei ON
                        nuclei.Metadata_TableNumber = image.Metadata_TableNumber
                        AND nuclei.Metadata_ImageNumber = image.Metadata_ImageNumber
                    INNER JOIN read_parquet('cytoplasm.parquet') AS cytoplasm ON
                        cytoplasm.Metadata_TableNumber = image.Metadata_TableNumber
                        AND cytoplasm.Metadata_ImageNumber = image.Metadata_ImageNumber
                        AND cytoplasm.Metadata_Cytoplasm_Parent_Nuclei = nuclei.Metadata_ObjectNumber
                    INNER JOIN read_parquet('cells.parquet') AS cells ON
                        cells.Metadata_TableNumber = cytoplasm.Metadata_TableNumber
                        AND cells.Metadata_ImageNumber = cytoplasm.Metadata_ImageNumber
                        AND cells.Metadata_Cells_Parent_Nuclei = nuclei.Metadata_ObjectNumber
                """

source_path = "test_SQ00014613.sqlite"
dest_path = "test_SQ00014613.parquet"

x = convert(
    source_path=source_path,
    dest_path=dest_path,
    identifying_columns=identifying_cols,
    dest_datatype="parquet",
    chunk_size=5000,
    preset="cell-health-cellprofiler-to-cytominer-database",
    joins=join_command,
    parsl_config=parsl.load(
        Config(
            executors=[
                ThreadPoolExecutor(
                    # set maximum number of threads at any time, for example 3.
                    # if not set, the default is 2.
                    max_threads=3, 
                )
            ]
        )
    ),
)

This does not factor in this advice:

Consider increasing your chunk_size to a higher number to improve time duration performance, contingent on what might work best for the data involved. What would work best will depend on the amount of system memory available where you run CytoTable, the shape of the source data (row and column size), and the complexity of the join operations. It may be worth a quick test using chunk sizes [10000, 100000, 500000, 1000000] to see what might perform the best.

cytomining / CytoTable

pyarrow.lib.ArrowInvalid schema difference errors occur with `duckdb==0.10.0` #163

DuckDB Duplicate Column Handling

Environment