adding metadata information to saving .npz files

gwaybio commented 4 years ago

np.savez_compressed() can receive multiple arguments. We should consider saving metadata information here in addition to encoding the info in the file name. File names can be overwritten and can be tough to extract from.

gwaybio commented 4 years ago

Note this command is used here.

jccaicedo commented 4 years ago

This is a very simple change, and it is implemented in branch issue-229 bb442238e9ebc665b1112dbcf1bc244827ea24ab

The following is the new output format:

object{
    "features": array,
    "metadata": {
        "Metadata_Plate": value,
        "Metadata_Well": value,
        "Metadata_Site": value,
        "other_fields": other_values
    }
}

@gwaygenomics I'll send you example outputs to get your feedback. @Arkkienkeli I changed the key names of the features, so this will break the downstream analysis for future experiments. If we decide to keep this format, it's best to recompute features in the experiments that we have done already.

gwaybio commented 4 years ago

this looks great! I see lots of potential in storing data this way - it creates a permanent link between the DeepProfiler features and metadata.

Two clarification questions:

How to use `index.csv`

In the example .npz file you sent over, the "metadata" object contain the following elements:

{
    "TableNumber": 1,
    "ImageNumber": 6,
    "Metadata_Plate": "Week1_22123",
    "Metadata_Well": "B03",
    "Metadata_Site": "s2",
    "Plate_Map_Name": "Week1",
    "DNA": "Week1_22123/Week1_150607_B03_s2_w1B41C8265-7501-433B-B901-C57F0A1A39B7.tif",
    "Tubulin": "Week1_22123/Week1_150607_B03_s2_w25CEC2D43-E105-42BB-BC00-6962B3ADEBED.tif",
    "Actin": "Week1_22123/Week1_150607_B03_s2_w45787A3F4-4DBD-45E1-B229-32BA1BFACAC6.tif",
    "compound": "cytochalasin B",
    "concentration": 30.0,
    "Replicate": 1,
    "Compound_Concentration": "cytochalasin B_30.0",
    "moa": "Actin disruptors",
}

these elements are also in the index.csv file. Would you prefer to handle the metadata using the index.csv file or by embedding it directly in the .npz?

Style

I think the easiest thing for us to do to integrate DeepProfiler .npz output with pycytominer is to mirror the CellProfiler/cytominer-database output dataframe style. By this I mean a dataframe starting with a handful of metadata columns prefixed with Metadata_ followed by feature columns prefixed by compartment. What do you want the prefix for DeepProfiler features? I am thinking DP0, DP1, etc. or DP_0, DP_1. Happy to go with what you prefer!

gwaybio commented 4 years ago

Perhaps I can include both index.csv and .npz metadata options for now, but I think making things consistent would be good.

gwaybio commented 4 years ago

I realized this discussion belongs in https://github.com/broadinstitute/DeepProfilerExperiments - I will cross post and link. We can continue the decisions there.

jccaicedo commented 4 years ago

That's correct, this implementation is copying the record of the index.csv file into the .npzfile.

To clarify, the index.csv file is required for training models and extracting features. For DeepProfiler to operate, we need that file long before the features are stored. It is meant to be a master file that guides the organization of images, treatments, plates and so on. At the moment this file is important for any type of professing with DeepProfiler, so we cannot get rid of it. If you think of the .npz files as the ultimate output of DeepProfiler, we could make these files independent of the index.csv file if we want. But the way I think DeepProfiler features should be distributed is with the set of individual .npz files plus the master index.csv to navigate the large amount of files.

So to your questions:

Metadata in the index.csv or in the .npz files: I'd rather have the metadata in the index.csv file instead of in the .npz files. I added all the record to facilitate the integration with pycytominer. But we should leave only the fields that are important for integration. The way DeepProfiler features are meant to be read is by looking into the index.csv file first, querying what you want, and then loading only the features that are necessary.
Style. I'd like to stick to numpy arrays rather than using dataframes or dictionaries for features. This format is computationally and storage efficient. With ~6k features per single cell, one plate takes about 4GB of space, and other models we're using recently generate 256 features per single cell, resulting in about 250MB per plate. Having independent files makes loading very fast. If possible, I'd prefer to keep these optimizations without additional metadata.

I agree that things should be consistent. I think keeping the metadata in the index.csv is the best. I'm going to remove unnecessary metadata from the .npz files. The only ones that I know are consistent across experiments are Metadata_Plate, Metadata_Well and Metadata_Site, and I think this is the minimum set of fields required. This leads me to the question: why do you need metadata in the .npz files?

I wonder if it would be easier to load all DeepProfiler feature files with one routine and then transform them into a format that is more compatible with cytominer-database. We have example code in DeepProfilerExperiments on how to do that, basically read the index.csv, loop through the rows to load individual .npz files and put everything together in a single dataframe. Such script can produce any format that is cytominer friendly. Do you think this makes sense?

Thanks for linking the issue in DeepProfilerExperiments. I think we can discuss the format of DeepProfiler features here (which impacts this repository) and the integration with pycytominer there (when we agree how the format should be).

gwaybio commented 4 years ago

Style. I'd like to stick to numpy arrays rather than using dataframes or dictionaries for features.

For sure - I agree that the DeepProfiler output shouldn't be tinkered, except maybe to add the metadata dictionary to the .npz as this issue proposes. What I am talking about is how DeepProfiler features from single cells will flow through the traditional profiling pipeline to get to aggregated/annotated profiles (see figure below). Anything I am referring to (except the metadata dictionary in .npz) is downstream of DeepProfiler output.

Cytominer Functionality (10)

Figure legend: Integrating DeepProfiler output with pycytominer. In the current CellProfiler pipeline, level 1 data (images) are first processed by CellProfiler for segmentation and feature extraction, followed (typically but not always) by cytominer-database to wrangle all of the single cell compartment files by site into a single database (.sqlite file) per plate. The most common entry point for pycytominer is at level 2 data (the .sqlite file). After level 2 data is ingested, we use pycytominer for aggregation (level 3 data), normalization (level 4a data), and for performing feature selection (level 4b data). See https://github.com/broadinstitute/lincs-cell-painting/issues/1#issuecomment-588411309 for data leveling details). We are currently working towards building the DeepProfiler integration as an alternative to CellProfiler. Ideally, the DeepProfiler output (level 2 data) will flow through a similar processing pipeline to arrive at aggregated/annotated profile data (level 3 data).

In order for DeepProfiler aggregated profiles (level 3 data) to be compatible with other pycytominer tools, we will need to use pandas DataFrames (for aggregated profiles only, not single cell data) that are in the same format as current level 3 CellProfiler output (metadata columns, followed by feature columns). Am I understanding your concern correctly? Does this make sense?

The only ones that I know are consistent across experiments are Metadata_Plate, Metadata_Well and Metadata_Site, and I think this is the minimum set of fields required. This leads me to the question: why do you need metadata in the .npz files?

My goal is to have this exact metadata information (Plate, Well, and Site) linked to the single cell profiles. I don't like them being embedded in the file name, but if this is what we decide, that is fine with me too! :) I also totally appreciate how difficult it is (and costly!) to introduce this sort of enhancement, so I am happy to go with what we decide is best, after weighing these factors that are outside my knowledge. Is there an alternative way to link single cell profiles to these three metadata features?

Thanks for linking the issue in DeepProfilerExperiments. I think we can discuss the format of DeepProfiler features here (which impacts this repository) and the integration with pycytominer there (when we agree how the format should be).

👍

gwaybio commented 4 years ago

I wonder if it would be easier to load all DeepProfiler feature files with one routine and then transform them into a format that is more compatible with cytominer-database. We have example code in DeepProfilerExperiments on how to do that, basically read the index.csv, loop through the rows to load individual .npz files and put everything together in a single dataframe. Such script can produce any format that is cytominer friendly. Do you think this makes sense?

Yes! Precisely! Except for this I'd like to sidestep cytominer-database. It is currently hard to use, and I don't see the upside to sqlite or parquet files compared to compressed text files.

But I will look into it... it might be nice to be globally consistent

jccaicedo commented 4 years ago

Thanks for clarifying @gwaygenomics !

In order for DeepProfiler aggregated profiles (level 3 data) to be compatible with other pycytominer tools, we will need to use pandas DataFrames (for aggregated profiles only, not single cell data) that are in the same format as current level 3 CellProfiler output (metadata columns, followed by feature columns). Am I understanding your concern correctly? Does this make sense?

Got it. Yes, this makes sense. I thought you were suggesting to name the features inside the .npz files, but now I understand that you want to have a naming convention for level 3 aggregated profiles. I like that idea, and it actually makes sense to think about the naming convention. Instead of having the prefix DP, we could use the name of the network, such as ResNet50_ or DenseNet121_. In that way, we know that the features come from a specific network, which can make a difference on how we interpret the results downstream. The question here would be, how do we feed this name to pycytominer?

My goal is to have this exact metadata information (Plate, Well, and Site) linked to the single cell profiles. I don't like them being embedded in the file name, but if this is what we decide, that is fine with me too! :) I also totally appreciate how difficult it is (and costly!) to introduce this sort of enhancement, so I am happy to go with what we decide is best, after weighing these factors that are outside my knowledge. Is there an alternative way to link single cell profiles to these three metadata features?

Good point. Here are some alternatives to naming the files with Plate, Well and Site:

Store the name of .npz files in the index.csv. This would make the naming independent of metadata. If the file name changes for any reason, the backup plan is to read the metadata fields that are inside the .npz file. This assumes that the strategy for using DeepProfiler features is top-down: from the metadata file you then find the feature files.
Forget the index.csv file and put all the metadata inside the .npz files. This assumes that the strategy for using DeepProfiler features is bottom-up: read the feature files to reconstruct the experiment metadata.

To be honest, I don't have a preference as long as the feature vectors don't change. I'm happy to keep all the metadata in the .npz files or in the index.csv or a mix of both, if it makes things easier and consistent with the CellProfiler way. At this point, I'd recommend to keep all the metadata in the .npz files and forget about the index.csv for downstream analysis. Do you see any issues or benefits in either strategy?

gwaybio commented 4 years ago

we could use the name of the network, such as ResNet50 or DenseNet121. In that way, we know that the features come from a specific network, which can make a difference on how we interpret the results downstream. The question here would be, how do we feed this name to pycytominer?

Love it! Right now, the way I see it is to introduce the function load_npz() to cytominer. We can tinker with the prefix using a function argument. The infrastructure that we end up building for pycytominer to convert level 2 DeepProfiler to level 3 will call this function, and therefore most commonly will not be called directly by humans.

How does DeepProfiler typically encode the network used? Is this info in the index.csv? Some other metadata file? If not, a simple (but more fragile) option is to include it as an argument to an aggegate(method="DeepProfiler", network_prefix="ResNet50_") call (or something similar).

I'm happy to keep all the metadata in the .npz files or in the index.csv or a mix of both, if it makes things easier and consistent with the CellProfiler way. At this point, I'd recommend to keep all the metadata in the .npz files and forget about the index.csv for downstream analysis. Do you see any issues or benefits in either strategy?

Cool. My vote is to have the metadata (at least Plate, Well, Site) in the npz file and also include the index.csv since it is a necessary input file anyway. However, one key point is that the metadata key in the npz file must contain a dictionary. This is critical to how load_npz() currently works. I will add a check for a dictionary output for safety.

So, to summarize, the pycytominer process will first look to the .npz file for metadata, and then look to the index.csv. We will include the union of metadata info in the .npz and index.csv in forming the metadata for aggregate profiles. Pycytominer will also check to see if any info in the index.csv contradicts the .npz file and throw some warnings if so.

The reason to allow for both is for backwards compatibility with legacy DeepProfiler datasets that only have index.csv files. How do you feel about this?

jccaicedo commented 4 years ago

Sounds good! The updated format has been implemented in 913fb329078403d90ce446986fc734f51c08e15b The format has the following fields:

{
    "features": array,
    "metadata": dict({
        'Metadata_Plate': value,
        'Metadata_Well': value,
        'Metadata_Site': value,
        'Metadata_Model': value
    })
}

Note that the dictionary in the metadata entry needs to be accessed in the following way:

data = np.load("features_file.npz", allow_pickle=True)
metadata = data["metadata"][()]  # Read a 0-dimensional array

I will share example data with this format!

jccaicedo commented 4 years ago

@gwaygenomics note that there is a Metadata_Model field which will have the name of the feature extraction model used to obtain these features.

jccaicedo commented 4 years ago

Here is a compressed file with example features. features.tar.gz

jccaicedo commented 4 years ago

As mentioned in the other discussion thread, we can easily recompute features with the new format that we agree on. I didn't realize you recommended an index file with paths to features, and we can produce that as an output of the feature extraction process. The path to files can also be included in our current metadata file, the index.csv. That would be the easiest as it already contains all the other information required to identify these features.

I will work on implementing this feature as an output of DeepProfiler and will report back here soon.

jccaicedo commented 4 years ago

I assume the paths to feature files should be relative to a root folder, as features can be moved from one environment to another and absolute paths don't transfer well. Is this correct @gwaygenomics ?

gwaybio commented 4 years ago

I didn't realize you recommended an index file with paths to features, and we can produce that as an output of the feature extraction process. The path to files can also be included in our current metadata file, the index.csv. That would be the easiest as it already contains all the other information required to identify these features.

I assume the paths to feature files should be relative to a root folder, as features can be moved from one environment to another and absolute paths don't transfer well. Is this correct

I think that my comment in https://github.com/broadinstitute/DeepProfilerExperiments/issues/2#issue-679189370 (also pasted below) is the source of this recommendation:

I believe that they should come from an internal source or be stored in an external file that includes file path information pointing to files with corresponding metadata. The latter is also fragile (file names are mutable!), but not as fragile as the metadata-in-file name paradigm.

If I'm understanding correctly, I don't actually recommend this approach. Linking file paths is also fragile for the reason you mention about absolute paths, and also because the absolute path can change without any consequence to the index.csv link. We had this happen in the Cell Health project, for example with the load_data.csv file.

I think the right approach is to encode all metadata information in the .npz file so that we can remove the need to also pass along the index csv from the perspective of pycytominer. I totally see the value of retaining the index.csv file for other uses (for one, its way easier to look at than an .npz file :joy:). But programmatically, I would be thrilled if we could provide a more permanent link between metadata and profiles by outputting a metadata dictionary to accompany each single cell profile.

jccaicedo commented 4 years ago

OK. Yes, I think we want to make the output of DeepProfiler compatible with pycytominer and we also want to make it easy to integrate. So in conclusion, for the purposes of pycytominer integration:

We will not use the index.csv file.
The .npz files will contain all the existing metadata record that is accessible to DeepProfiler in a dictionary.
We will not try to keep any backwards compatibility. This will be the official and only format supported by pycytominer for DeepProfiler features.
The metadata will include the name of the model for feature renaming in any processing concerning pycytominer.

This output will be the official format for DeepProfiler. I will create a PR to merge it and recompute the features in the datasets that we are working with at the moment. After closing this thread, we will continue the conversation of how to use pycytominer in the downstream analysis, which is maintained in the other repository.

jccaicedo commented 4 years ago

Here is PR #240 with the changes. @Arkkienkeli , can you please review and merge? After this, we can close this issue, generate features for the datasets, and continue the conversation of integrating with pycytominer in the other thread.

jccaicedo commented 4 years ago

@Arkkienkeli has merged the code and the implementation is now part of the master development! The next step is to re-compute features for the datasets and update the downstream analysis code in the DeepProfilerExperiments repository. Closing this issue here and continuing the conversation in there.

gwaybio commented 4 years ago

@gwaygenomics note that there is a Metadata_Model field which will have the name of the feature extraction model used to obtain these features.

One minor clarification question: Should this field be used as the DeepProfiler morphology feature prefix? (like Cells_, Cytoplasm_, Nuclei_ in CellProfiler)

gwaybio commented 2 years ago

Slight modifications to parsing plate, well, site info from filenames fixed in cytomining/pycytominer#210

cytomining / DeepProfiler