broadinstitute / DeepProfilerExperiments

7 stars 5 forks source link

pycytominer integration #2

Open gwaybio opened 4 years ago

gwaybio commented 4 years ago

In cytomining/pycytominer#78 I am working towards integrating DeepProfiler processing into pycytominer. Currently and by default, DeepProfiler outputs .npz files storing numpy arrays of single cell profiles. In cytomining/DeepProfiler#229 we discuss a potential update to the .npz file output to also include metadata information.

There are a couple of decision points that we need to make to move the integration forward, which will be partially driven by the goals in the DeepProfilerExperiments repo. In https://github.com/cytomining/DeepProfiler/issues/229#issuecomment-673839751 I bring up two different points of consideration: 1) How to use index.csv and 2) Feature prefix style.

I think both of these decision points are relatively minor, and any pycytominer code will be flexible to handle multiple metadata options and enable a customizable feature prefix. The question about feature prefix is most directly related to what we think the default prefix should be (DP or DP_ are two options)

Additional topics

I think that these topics are more pressing than the first two listed above: Will the profiles be updated for each dataset to include the metadata .npz format? Or, will we proceed without recalculating? If we proceed without recalculating (which I think is the likely scenario), we need to settle on pycytominer strategy.

Strategy

I do not think that pycytominer should include code to parse plate, well, and site information from filenames. This is a very fragile way of storing these variables - I believe that they should come from an internal source or be stored in an external file that includes file path information pointing to files with corresponding metadata. The latter is also fragile (file names are mutable!), but not as fragile as the metadata-in-file name paradigm.

However, since we probably won't recompute profiles, we require a strategy to incorporate metadata from file names. Therefore, I propose that we take multiple pycytominer steps to integrate these metadata (instead of dealing with all of the processing internally in pycytominer).

The proposed workflow is as follows:

  1. Ingest current .npz files in pycytominer
  2. Extract out plate, well, and site from file name
  3. Append these metadata to a pycytominer load_npz() output
  4. Reingest this file with metadata back into pycytominer and proceed with standard downstream processing

I will proceed with this strategy for now, but please do suggest alternatives! We can always pivot strategies later on if this ends up being clunky or doesn't reduce code.

gwaybio commented 4 years ago

Adding a question closely related to https://github.com/cytomining/DeepProfiler/issues/229#issuecomment-674193206

The reason to allow for both is for backwards compatibility with legacy DeepProfiler datasets that only have index.csv files.

For these files specifically (for example, the ones in DeepProfilerExperiments), is the only way to extract Plate, Well, and Site metadata is to parse the file name? Or is there a better way?

jccaicedo commented 4 years ago

We can easily recompute DeepProfiler features. We're happy to do so to test the new format. Computing features is not as expensive as training a model. Given that we are constantly training and evaluating features, the feature computation part is kind of routine and can be repeated any time.

So I would suggest to ignore backwards compatibility issues or rescuing old feature files already computed. It's easier to delete these files and generate new ones with the best format that we agree to have 🙂

I missed on the feature file list before. I think this needs additional implementation in DeepProfiler. I will add this comment to our other discussion.

gwaybio commented 4 years ago

So I would suggest to ignore backwards compatibility issues or rescuing old feature files already computed. It's easier to delete these files and generate new ones with the best format that we agree to have

Great! This will make the code in each DeepProfiler experiment notebook (for each dataset) much cleaner and more streamlined.

jccaicedo commented 4 years ago

We are recomputing features for Cell Painting datasets. I will make a note in this thread when features are available to start integrating pycytominer in the downstream analysis.

gwaybio commented 3 years ago

@jccaicedo I set aside time today to push the DeepProfiler-pycytominer integration further along. Two questions:

  1. Are the example .npz file features here https://github.com/cytomining/DeepProfiler/issues/229#issuecomment-675143654 the finalized DeepProfiler output files?
  2. If so, can I use one of them in the pycytominer tests?

Also, one quick note: I went through all of our existing discussion once more and it was fun

gwaybio commented 3 years ago

@jccaicedo - The DeepProfiler and CellProfiler comparison analysis keeps popping into my head. I am wondering if there are any updates?

I am writing in this thread b/c of the questions I had a couple months ago. I'd like to finalize the DeepProfiler integration in cytomining/pycytominer#78, and I think this is where I can contribute most to your project.

gwaybio commented 3 years ago

Hey @jccaicedo, @michaelbornholdt and I just chatted about the remaining steps to add DeepProfiler integration into pycytominer. I think we're very close! Here is a summary of our current plan - please feel free to modify.

  1. You provide us permission to include the file Week1_22123_B02_s1.npz in pycytominer as an example data file for testing.
    • I've not yet added the file in cytomining/pycytominer#78, but it is ready to go here.
    • If you cannot make that data public, can you suggest a more suitable alternative?
  2. I will add it to that pull request
  3. Michael will troubleshoot, add/modify tests, and complete the integration

Once these things happen, then Michael will be more readily able to benchmark his DeepProfiler comparison experiments!

jccaicedo commented 3 years ago

@gwaygenomics You can add that file to your tests. The plan looks great!

gwaybio commented 3 years ago

Awesome! I added the file in https://github.com/cytomining/pycytominer/pull/78/commits/d237b41891e47dba546b6d85ccfe9072168c62e3

@michaelbornholdt - I just now realized that pycytominer.cyto_utils.DeepProfiler_processing.AggregateDeepProfiler requires testing in cytomining/pycytominer#78, which will likely require you to add the remaining .npz files and index.csv that I sent you on slack in the appropriate folder