Open gwaybio opened 4 years ago
Adding a question closely related to https://github.com/cytomining/DeepProfiler/issues/229#issuecomment-674193206
The reason to allow for both is for backwards compatibility with legacy DeepProfiler datasets that only have
index.csv
files.
For these files specifically (for example, the ones in DeepProfilerExperiments), is the only way to extract Plate, Well, and Site metadata is to parse the file name? Or is there a better way?
We can easily recompute DeepProfiler features. We're happy to do so to test the new format. Computing features is not as expensive as training a model. Given that we are constantly training and evaluating features, the feature computation part is kind of routine and can be repeated any time.
So I would suggest to ignore backwards compatibility issues or rescuing old feature files already computed. It's easier to delete these files and generate new ones with the best format that we agree to have 🙂
I missed on the feature file list before. I think this needs additional implementation in DeepProfiler. I will add this comment to our other discussion.
So I would suggest to ignore backwards compatibility issues or rescuing old feature files already computed. It's easier to delete these files and generate new ones with the best format that we agree to have
Great! This will make the code in each DeepProfiler experiment notebook (for each dataset) much cleaner and more streamlined.
We are recomputing features for Cell Painting datasets. I will make a note in this thread when features are available to start integrating pycytominer in the downstream analysis.
@jccaicedo I set aside time today to push the DeepProfiler-pycytominer integration further along. Two questions:
.npz
file features here https://github.com/cytomining/DeepProfiler/issues/229#issuecomment-675143654 the finalized DeepProfiler output files?Also, one quick note: I went through all of our existing discussion once more and it was fun
@jccaicedo - The DeepProfiler and CellProfiler comparison analysis keeps popping into my head. I am wondering if there are any updates?
I am writing in this thread b/c of the questions I had a couple months ago. I'd like to finalize the DeepProfiler integration in cytomining/pycytominer#78, and I think this is where I can contribute most to your project.
Hey @jccaicedo, @michaelbornholdt and I just chatted about the remaining steps to add DeepProfiler integration into pycytominer. I think we're very close! Here is a summary of our current plan - please feel free to modify.
Week1_22123_B02_s1.npz
in pycytominer as an example data file for testing.
Once these things happen, then Michael will be more readily able to benchmark his DeepProfiler comparison experiments!
@gwaygenomics You can add that file to your tests. The plan looks great!
Awesome! I added the file in https://github.com/cytomining/pycytominer/pull/78/commits/d237b41891e47dba546b6d85ccfe9072168c62e3
@michaelbornholdt - I just now realized that pycytominer.cyto_utils.DeepProfiler_processing.AggregateDeepProfiler
requires testing in cytomining/pycytominer#78, which will likely require you to add the remaining .npz
files and index.csv
that I sent you on slack in the appropriate folder
In cytomining/pycytominer#78 I am working towards integrating DeepProfiler processing into pycytominer. Currently and by default, DeepProfiler outputs
.npz
files storing numpy arrays of single cell profiles. In cytomining/DeepProfiler#229 we discuss a potential update to the.npz
file output to also include metadata information.There are a couple of decision points that we need to make to move the integration forward, which will be partially driven by the goals in the DeepProfilerExperiments repo. In https://github.com/cytomining/DeepProfiler/issues/229#issuecomment-673839751 I bring up two different points of consideration: 1) How to use
index.csv
and 2) Feature prefix style.I think both of these decision points are relatively minor, and any pycytominer code will be flexible to handle multiple metadata options and enable a customizable feature prefix. The question about feature prefix is most directly related to what we think the default prefix should be (
DP
orDP_
are two options)Additional topics
I think that these topics are more pressing than the first two listed above: Will the profiles be updated for each dataset to include the metadata
.npz
format? Or, will we proceed without recalculating? If we proceed without recalculating (which I think is the likely scenario), we need to settle on pycytominer strategy.Strategy
I do not think that pycytominer should include code to parse plate, well, and site information from filenames. This is a very fragile way of storing these variables - I believe that they should come from an internal source or be stored in an external file that includes file path information pointing to files with corresponding metadata. The latter is also fragile (file names are mutable!), but not as fragile as the metadata-in-file name paradigm.
However, since we probably won't recompute profiles, we require a strategy to incorporate metadata from file names. Therefore, I propose that we take multiple pycytominer steps to integrate these metadata (instead of dealing with all of the processing internally in pycytominer).
The proposed workflow is as follows:
.npz
files in pycytominerload_npz()
outputI will proceed with this strategy for now, but please do suggest alternatives! We can always pivot strategies later on if this ends up being clunky or doesn't reduce code.