How does CytoTable handle duplicate ImageNumbers?

shntnu commented 1 year ago

ImageNumber is guaranteed to be unique only within an Image.csv but not across all Image.csv files in a dataset.

Does pycytominer-transform handle this correctly?

In cytominer-database, we create a column, TableNumber, to identify rows in the Image table uniquely. TableNumber is set to be the hash of the Image.csv file:

https://github.com/cytomining/cytominer-database/blob/5aa00f58e4a31bbbd2a3779c87e7a3620b0030db/cytominer_database/ingest_variable_engine.py#L99

d33bs commented 1 year ago

Thank you @shntnu ! I don't believe this has been addressed yet in this project and I'll look into how this might be implemented. In addition to TableNumber, and acknowledging some of the s3://cellpainting-gallery object structures, would it make sense to utilize "directory" paths as another way to retain provenance and uniqueness across ImageNumber's for compartment objects? I'm wondering about this especially for scenarios where we may not have access to the image files themselves.

shntnu commented 1 year ago

would it make sense to utilize "directory" paths as another way to retain provenance and uniqueness across ImageNumber's for compartment objects? I'm wondering about this especially for scenarios where we may not have access to the image files themselves.

I think that would be fine.

To clarify, both these hashes serve the goal of creating a unique ID for the compartments.

md5sum <(echo s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/)
# fcc94af68878c557192a6d8c3d069cb7  /dev/fd/63

md5sum <(aws s3 cp s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Image.csv -)
# 9a7a2deed81a80ebacb139e7a249e3ab  /dev/fd/63

We'd prefer the second when possible because although both are unique, the second one is path invariant (but it doesn't matter too much, just that in the first option the TableNumber is a function of the path, which is a bit odd but not wrong)

cc'ing @bethac07 to keep her posted because this relates to CellProfiler output.

d33bs commented 1 year ago

Hi @shntnu and @gwaybio, I've been thinking about how to implement new features into CytoTable in order to add TableNumber. I wanted to ask for your input on what would be best when it comes to SQLite data sources and generating hashes. I can see two approaches:

Create a unique hash per SQLite table, meaning we must read the data inside the table in order to generate the hash (as it is not a standalone file).
Create a unique hash per SQLite database, using the same hash (from the .sqlite or .db file) for each compartment and metadata table result.

Option 1. seems most in alignment with current procedures but could come with performance impacts (unsure of the most performant way to generate a hash from the SQLite data before an export to other formats). Option 2. would deviate from "table-focused" alignment of the hash data, but could come with a performance benefit.

Any thoughts or preferences here? Thank you in advance for your consideration on this.

shntnu commented 1 year ago

@d33bs hm - I don't think I followed what you proposed.

Neither of the options you propose would generate a hash per Image.csv, correct?

Note that typically, but not always, an Image.csv file generated by CellProfiler has multiple rows, one corresponding to each image processed in that set.

We want to create a hash per Image.csv so that the hash, together with the ImageNumber, can uniquely identify each row of the Image table that would result from ingesting all the Image.csvs from a dataset.

Does that make sense?

d33bs commented 1 year ago

Thanks @shntnu for the greater clarity here! I wanted to outline what I understand of your thoughts and how these might extrapolate to SQLite data sources in addition to CSV's (attempting to maintain uniformity in behavior).

flowchart LR

images[(images)]
cellprofiler["CellProfiler"]
subgraph cellprofiler_csv
  image_csv[("image.csv")]
  compartment_csv[("compartment(s).csv")]
end
subgraph cellprofiler_sqlite
  image_tbl[("image.tbl")]
  compartment_tbl[("compartment(s).tbl")]
end

cytotable_hash_extract["CytoTable \n creates a hash"]
cytotable_extract["CytoTable \n extracts full data \n with appended hash column"]

image_parquet[(image.parquet)]
compartments_parquet[("compartments(s).parquet")]

images --> cellprofiler

cellprofiler --> cellprofiler_csv
cellprofiler --> cellprofiler_sqlite

image_csv --> | infer a tablenumber hash \n from image.csv | cytotable_hash_extract
image_tbl --> | infer a tablenumber hash \n from image table | cytotable_hash_extract

cytotable_hash_extract -.-> | afterwards, extract all data \n from cellprofiler sources | cytotable_extract

cytotable_extract --> | image file or table hash is \n added to image data | image_parquet
cytotable_extract --> | image file or table hash is \n added to compartment data | compartments_parquet

With this diagram I'm trying to illustrate that the "tablenumber" hash from the image csv or SQLite table is added to image and compartment data in an effort to retain data relationships (especially when parsing multiple nested groups of files or SQLite files). I've left out some lines connecting the cellprofiler data sources to elements further on the right side to help highlight the hashing elements specifically.

One challenge with the SQLite based image table approach is that we may be forced to read the data from the table in order to generate the hash (incurring performance costs and branching complexity within the code).

Does this align with your thoughts on how this should work? Very open to input here and seeking to help make this work the best for you and the community.

shntnu commented 1 year ago

@d33bs Thanks for the diagram and for clarifying how all this fits in.

I now understand the logic of Option 2 in your previous comment https://github.com/cytomining/CytoTable/issues/30#issuecomment-1553493360.

Indeed, Option 2 will work fine. Here's why:

CellProfiler's ExportToDatabase module will create a single SQLite file (assuming we are exporting to SQLite; we can ignore server-based backends entirely) for each run, just as it would create a single set of CSVs for each run in ExportToSpreadsheet.

Therefore, the SQLite file hash will be just as identifiable as the Image table hash.

I was previously confused because I thought the SQLite file we are talking about here is the one we create after ingesting, but it is, in fact, the SQLite file that is the output of ExportToDatabase.

Please proceed with Option 2:

Create a unique hash per SQLite database, using the same hash (from the .sqlite or .db file) for each compartment and metadata table result.

d33bs commented 1 year ago

Will do, thank you @shntnu !

d33bs commented 1 year ago

Hi @shntnu, in working towards changes related to this I found a discrepancy in the current way CytoTable operates vs Cytominer-database. Cytominer-database utilizes various test data and techniques for eliminating candidates for data processing. Some of these tests are used with CytoTable to help ensure similar functionality where desired. One of these test datasets, data_b, shows discrepancies between Cytominer-database and CytoTable functionality.

Cytominer-database removes directory J21-2 entirely due to Cells.csv being invalid (I believe due to csvkit flagging the file as having mismatched row value counts on line 2 here). CytoTable removes only Cells.csv (and not the other compartment or image tables) due to similar issues with the file (via DuckDB errors in reading the file, which manifest as read errors and are filtered from output downstream of here).

I noticed this when testing and comparing results of Cytominer-database and CytoTable, which I'm using to judge to ensure data output matches expectations (where possible). The difference in these processing methods results in a mismatch for TableNumber production. Question: Should CytoTable remove the entirety of a directory of source data (all compartments + image table data) when a single file shows issues in that directory? If it makes more sense to discuss this outside of this issue, I can open a new issue to discuss which data are invalid vs not.

shntnu commented 7 months ago

Sorry for taking so long to respond to this

Should CytoTable remove the entirety of a directory of source data (all compartments + image table data) when a single file shows issues in that directory? If it makes more sense to discuss this outside of this issue, I can open a new issue to discuss which data are invalid vs not.

I think no, it should not drop. When analyzing the data downstream, one can always explicitly drop such cases.

d33bs commented 7 months ago

No worries, and thanks @shntnu ! We will stick with the existing functionality in CytoTable to drop individual compartments (files or tables) where there are errors and not an entire directory of image + compartment data.

shntnu commented 2 weeks ago

Hooray for #188! 🎉

d33bs commented 2 weeks ago

Cheers thanks @shntnu for your help in understanding the solution for this!

cytomining / CytoTable

How does CytoTable handle duplicate ImageNumbers? #30