cytomining / pycytominer

Python package for processing image-based profiling data
https://pycytominer.readthedocs.io
BSD 3-Clause "New" or "Revised" License
79 stars 35 forks source link

Should pycytominer generate additional metadata columns by default? #130

Open gwaybio opened 3 years ago

gwaybio commented 3 years ago

In #129, @niranjchandrasekaran proposed that pycytominer create a metadata column, called Metadata_fields_count, that will be added to all future pycytominer-derived aggregate profiles. This represents the first time (IIRC) that pycytominer is actively creating a metadata column, instead of passively using those provided in the platemap files.

Adding more default metadata columns helpful to all profiling experiments in beyond scope of #129, but we can track such columns in this issue.

niranjchandrasekaran commented 3 years ago

129 adds two new columns to the profiles - Metadata_Site_Count and Metadata_Object_Count.

Metadata_Site_Count reports the number of sites aggregated to create the well level profile and Metadata_Object_Count reports the number of objects in the first compartment (usually cells and thus it is a feature that reports cell count).

With respect to Metadata_Object_Count, there was a discussion in #129 regarding how many such features should be created and what should their names be?

  1. Number of features - @shntnu mentioned in https://github.com/cytomining/pycytominer/pull/129#issuecomment-799370869 that sometimes the object count could be different for different compartments. This would argue in favor of creating a count feature for each compartment. This could result in a number of redundant count features though this might not really matter.

  2. Name of the feature - Let's say that we go with a single object count feature, then if cells is by default the first compartment, we could rename the feature Metadata_Cells_Count. But if the compartment order changes, this feature name will be meaningless. @gwaygenomics suggested renaming the feature f"Metadata_Object_Count_{compartment}" so whatever the first compartment is, the feature would take its name. This would mean that changing the order of the compartments will generate profiles with different metadata columns (not sure if this is ok).

Based on everything above which one of the following options should we go with? a. Keep the feature name as is and stop overthinking Niranj (Metadata_Object_Count). b. Rename the feature name to Metadata_Cell_Count (as cells is by default the first compartment) c. Create a separate count feature for each compartment as the count could be different (for example Metadata_Cells_Count, Metadata_Cytoplasm_Count and Metadata_Nucleus_Count). d. Rename the feature f"Metadata_Object_Count_{compartment}" and it will take the name of the first compartment (typically "cells").

cc @AnneCarpenter

gwaybio commented 3 years ago

very clear summary. Now that it's laid out like this, I can see that my suggestion (d) is not ideal!

My vote is for option (a). We don't know how we're going to use the information in this column yet - maybe just to summarize? - so until we have more clarity, option (a) is totally sufficient and we can put this on the priority back-burner

gwaybio commented 1 year ago

related to #80