Closed gwaybio closed 4 years ago
This is a very simple change, and it is implemented in branch issue-229
bb442238e9ebc665b1112dbcf1bc244827ea24ab
The following is the new output format:
object{
"features": array,
"metadata": {
"Metadata_Plate": value,
"Metadata_Well": value,
"Metadata_Site": value,
"other_fields": other_values
}
}
@gwaygenomics I'll send you example outputs to get your feedback. @Arkkienkeli I changed the key names of the features, so this will break the downstream analysis for future experiments. If we decide to keep this format, it's best to recompute features in the experiments that we have done already.
this looks great! I see lots of potential in storing data this way - it creates a permanent link between the DeepProfiler features and metadata.
Two clarification questions:
index.csv
In the example .npz
file you sent over, the "metadata" object contain the following elements:
{
"TableNumber": 1,
"ImageNumber": 6,
"Metadata_Plate": "Week1_22123",
"Metadata_Well": "B03",
"Metadata_Site": "s2",
"Plate_Map_Name": "Week1",
"DNA": "Week1_22123/Week1_150607_B03_s2_w1B41C8265-7501-433B-B901-C57F0A1A39B7.tif",
"Tubulin": "Week1_22123/Week1_150607_B03_s2_w25CEC2D43-E105-42BB-BC00-6962B3ADEBED.tif",
"Actin": "Week1_22123/Week1_150607_B03_s2_w45787A3F4-4DBD-45E1-B229-32BA1BFACAC6.tif",
"compound": "cytochalasin B",
"concentration": 30.0,
"Replicate": 1,
"Compound_Concentration": "cytochalasin B_30.0",
"moa": "Actin disruptors",
}
these elements are also in the index.csv
file. Would you prefer to handle the metadata using the index.csv
file or by embedding it directly in the .npz
?
I think the easiest thing for us to do to integrate DeepProfiler .npz
output with pycytominer is to mirror the CellProfiler/cytominer-database output dataframe style. By this I mean a dataframe starting with a handful of metadata columns prefixed with Metadata_
followed by feature columns prefixed by compartment. What do you want the prefix for DeepProfiler features? I am thinking DP0
, DP1
, etc. or DP_0
, DP_1
. Happy to go with what you prefer!
Perhaps I can include both index.csv
and .npz
metadata options for now, but I think making things consistent would be good.
I realized this discussion belongs in https://github.com/broadinstitute/DeepProfilerExperiments - I will cross post and link. We can continue the decisions there.
That's correct, this implementation is copying the record of the index.csv
file into the .npz
file.
To clarify, the index.csv
file is required for training models and extracting features. For DeepProfiler to operate, we need that file long before the features are stored. It is meant to be a master file that guides the organization of images, treatments, plates and so on. At the moment this file is important for any type of professing with DeepProfiler, so we cannot get rid of it. If you think of the .npz
files as the ultimate output of DeepProfiler, we could make these files independent of the index.csv
file if we want. But the way I think DeepProfiler features should be distributed is with the set of individual .npz
files plus the master index.csv
to navigate the large amount of files.
So to your questions:
index.csv
or in the .npz
files: I'd rather have the metadata in the index.csv
file instead of in the .npz
files. I added all the record to facilitate the integration with pycytominer. But we should leave only the fields that are important for integration. The way DeepProfiler features are meant to be read is by looking into the index.csv
file first, querying what you want, and then loading only the features that are necessary.I agree that things should be consistent. I think keeping the metadata in the index.csv
is the best. I'm going to remove unnecessary metadata from the .npz
files. The only ones that I know are consistent across experiments are Metadata_Plate
, Metadata_Well
and Metadata_Site
, and I think this is the minimum set of fields required. This leads me to the question: why do you need metadata in the .npz
files?
I wonder if it would be easier to load all DeepProfiler feature files with one routine and then transform them into a format that is more compatible with cytominer-database. We have example code in DeepProfilerExperiments on how to do that, basically read the index.csv
, loop through the rows to load individual .npz
files and put everything together in a single dataframe. Such script can produce any format that is cytominer friendly. Do you think this makes sense?
Thanks for linking the issue in DeepProfilerExperiments. I think we can discuss the format of DeepProfiler features here (which impacts this repository) and the integration with pycytominer there (when we agree how the format should be).
Style. I'd like to stick to numpy arrays rather than using dataframes or dictionaries for features.
For sure - I agree that the DeepProfiler output shouldn't be tinkered, except maybe to add the metadata dictionary to the .npz
as this issue proposes. What I am talking about is how DeepProfiler features from single cells will flow through the traditional profiling pipeline to get to aggregated/annotated profiles (see figure below). Anything I am referring to (except the metadata dictionary in .npz) is downstream of DeepProfiler output.
Figure legend: Integrating DeepProfiler output with pycytominer. In the current CellProfiler pipeline, level 1 data (images) are first processed by CellProfiler for segmentation and feature extraction, followed (typically but not always) by cytominer-database to wrangle all of the single cell compartment files by site into a single database (.sqlite file) per plate. The most common entry point for pycytominer is at level 2 data (the
.sqlite
file). After level 2 data is ingested, we use pycytominer for aggregation (level 3 data), normalization (level 4a data), and for performing feature selection (level 4b data). See https://github.com/broadinstitute/lincs-cell-painting/issues/1#issuecomment-588411309 for data leveling details). We are currently working towards building the DeepProfiler integration as an alternative to CellProfiler. Ideally, the DeepProfiler output (level 2 data) will flow through a similar processing pipeline to arrive at aggregated/annotated profile data (level 3 data).
In order for DeepProfiler aggregated profiles (level 3 data) to be compatible with other pycytominer tools, we will need to use pandas DataFrames (for aggregated profiles only, not single cell data) that are in the same format as current level 3 CellProfiler output (metadata columns, followed by feature columns). Am I understanding your concern correctly? Does this make sense?
The only ones that I know are consistent across experiments are Metadata_Plate, Metadata_Well and Metadata_Site, and I think this is the minimum set of fields required. This leads me to the question: why do you need metadata in the .npz files?
My goal is to have this exact metadata information (Plate, Well, and Site) linked to the single cell profiles. I don't like them being embedded in the file name, but if this is what we decide, that is fine with me too! :) I also totally appreciate how difficult it is (and costly!) to introduce this sort of enhancement, so I am happy to go with what we decide is best, after weighing these factors that are outside my knowledge. Is there an alternative way to link single cell profiles to these three metadata features?
Thanks for linking the issue in DeepProfilerExperiments. I think we can discuss the format of DeepProfiler features here (which impacts this repository) and the integration with pycytominer there (when we agree how the format should be).
👍
I wonder if it would be easier to load all DeepProfiler feature files with one routine and then transform them into a format that is more compatible with cytominer-database. We have example code in DeepProfilerExperiments on how to do that, basically read the index.csv, loop through the rows to load individual .npz files and put everything together in a single dataframe. Such script can produce any format that is cytominer friendly. Do you think this makes sense?
Yes! Precisely! Except for this I'd like to sidestep cytominer-database. It is currently hard to use, and I don't see the upside to sqlite
or parquet
files compared to compressed text files.
But I will look into it... it might be nice to be globally consistent
Thanks for clarifying @gwaygenomics !
In order for DeepProfiler aggregated profiles (level 3 data) to be compatible with other pycytominer tools, we will need to use pandas DataFrames (for aggregated profiles only, not single cell data) that are in the same format as current level 3 CellProfiler output (metadata columns, followed by feature columns). Am I understanding your concern correctly? Does this make sense?
Got it. Yes, this makes sense.
I thought you were suggesting to name the features inside the .npz
files, but now I understand that you want to have a naming convention for level 3 aggregated profiles. I like that idea, and it actually makes sense to think about the naming convention. Instead of having the prefix DP
, we could use the name of the network, such as ResNet50_
or DenseNet121_
. In that way, we know that the features come from a specific network, which can make a difference on how we interpret the results downstream. The question here would be, how do we feed this name to pycytominer?
My goal is to have this exact metadata information (Plate, Well, and Site) linked to the single cell profiles. I don't like them being embedded in the file name, but if this is what we decide, that is fine with me too! :) I also totally appreciate how difficult it is (and costly!) to introduce this sort of enhancement, so I am happy to go with what we decide is best, after weighing these factors that are outside my knowledge. Is there an alternative way to link single cell profiles to these three metadata features?
Good point. Here are some alternatives to naming the files with Plate, Well and Site:
.npz
files in the index.csv
. This would make the naming independent of metadata. If the file name changes for any reason, the backup plan is to read the metadata fields that are inside the .npz
file. This assumes that the strategy for using DeepProfiler features is top-down: from the metadata file you then find the feature files.index.csv
file and put all the metadata inside the .npz
files. This assumes that the strategy for using DeepProfiler features is bottom-up: read the feature files to reconstruct the experiment metadata.To be honest, I don't have a preference as long as the feature vectors don't change. I'm happy to keep all the metadata in the .npz
files or in the index.csv
or a mix of both, if it makes things easier and consistent with the CellProfiler way. At this point, I'd recommend to keep all the metadata in the .npz
files and forget about the index.csv
for downstream analysis. Do you see any issues or benefits in either strategy?
we could use the name of the network, such as ResNet50 or DenseNet121. In that way, we know that the features come from a specific network, which can make a difference on how we interpret the results downstream. The question here would be, how do we feed this name to pycytominer?
Love it! Right now, the way I see it is to introduce the function load_npz()
to cytominer. We can tinker with the prefix using a function argument. The infrastructure that we end up building for pycytominer to convert level 2 DeepProfiler to level 3 will call this function, and therefore most commonly will not be called directly by humans.
How does DeepProfiler typically encode the network used? Is this info in the index.csv
? Some other metadata file? If not, a simple (but more fragile) option is to include it as an argument to an aggegate(method="DeepProfiler", network_prefix="ResNet50_")
call (or something similar).
I'm happy to keep all the metadata in the .npz files or in the index.csv or a mix of both, if it makes things easier and consistent with the CellProfiler way. At this point, I'd recommend to keep all the metadata in the .npz files and forget about the index.csv for downstream analysis. Do you see any issues or benefits in either strategy?
Cool. My vote is to have the metadata (at least Plate, Well, Site) in the npz file and also include the index.csv
since it is a necessary input file anyway. However, one key point is that the metadata
key in the npz
file must contain a dictionary. This is critical to how load_npz()
currently works. I will add a check for a dictionary output for safety.
So, to summarize, the pycytominer process will first look to the .npz
file for metadata, and then look to the index.csv
. We will include the union of metadata info in the .npz
and index.csv
in forming the metadata for aggregate profiles. Pycytominer will also check to see if any info in the index.csv
contradicts the .npz
file and throw some warnings if so.
The reason to allow for both is for backwards compatibility with legacy DeepProfiler datasets that only have index.csv
files. How do you feel about this?
Sounds good! The updated format has been implemented in 913fb329078403d90ce446986fc734f51c08e15b The format has the following fields:
{
"features": array,
"metadata": dict({
'Metadata_Plate': value,
'Metadata_Well': value,
'Metadata_Site': value,
'Metadata_Model': value
})
}
Note that the dictionary in the metadata
entry needs to be accessed in the following way:
data = np.load("features_file.npz", allow_pickle=True)
metadata = data["metadata"][()] # Read a 0-dimensional array
I will share example data with this format!
@gwaygenomics note that there is a Metadata_Model
field which will have the name of the feature extraction model used to obtain these features.
Here is a compressed file with example features. features.tar.gz
As mentioned in the other discussion thread, we can easily recompute features with the new format that we agree on. I didn't realize you recommended an index file with paths to features, and we can produce that as an output of the feature extraction process. The path to files can also be included in our current metadata file, the index.csv
. That would be the easiest as it already contains all the other information required to identify these features.
I will work on implementing this feature as an output of DeepProfiler and will report back here soon.
I assume the paths to feature files should be relative to a root folder, as features can be moved from one environment to another and absolute paths don't transfer well. Is this correct @gwaygenomics ?
I didn't realize you recommended an index file with paths to features, and we can produce that as an output of the feature extraction process. The path to files can also be included in our current metadata file, the index.csv. That would be the easiest as it already contains all the other information required to identify these features.
I assume the paths to feature files should be relative to a root folder, as features can be moved from one environment to another and absolute paths don't transfer well. Is this correct
I think that my comment in https://github.com/broadinstitute/DeepProfilerExperiments/issues/2#issue-679189370 (also pasted below) is the source of this recommendation:
I believe that they should come from an internal source or be stored in an external file that includes file path information pointing to files with corresponding metadata. The latter is also fragile (file names are mutable!), but not as fragile as the metadata-in-file name paradigm.
If I'm understanding correctly, I don't actually recommend this approach. Linking file paths is also fragile for the reason you mention about absolute paths, and also because the absolute path can change without any consequence to the index.csv
link. We had this happen in the Cell Health project, for example with the load_data.csv file.
I think the right approach is to encode all metadata information in the .npz file so that we can remove the need to also pass along the index csv from the perspective of pycytominer. I totally see the value of retaining the index.csv file for other uses (for one, its way easier to look at than an .npz
file :joy:). But programmatically, I would be thrilled if we could provide a more permanent link between metadata and profiles by outputting a metadata dictionary to accompany each single cell profile.
OK. Yes, I think we want to make the output of DeepProfiler compatible with pycytominer and we also want to make it easy to integrate. So in conclusion, for the purposes of pycytominer integration:
index.csv
file..npz
files will contain all the existing metadata record that is accessible to DeepProfiler in a dictionary. This output will be the official format for DeepProfiler. I will create a PR to merge it and recompute the features in the datasets that we are working with at the moment. After closing this thread, we will continue the conversation of how to use pycytominer in the downstream analysis, which is maintained in the other repository.
Here is PR #240 with the changes. @Arkkienkeli , can you please review and merge? After this, we can close this issue, generate features for the datasets, and continue the conversation of integrating with pycytominer in the other thread.
@Arkkienkeli has merged the code and the implementation is now part of the master development! The next step is to re-compute features for the datasets and update the downstream analysis code in the DeepProfilerExperiments repository. Closing this issue here and continuing the conversation in there.
@gwaygenomics note that there is a Metadata_Model field which will have the name of the feature extraction model used to obtain these features.
One minor clarification question: Should this field be used as the DeepProfiler morphology feature prefix? (like Cells_
, Cytoplasm_
, Nuclei_
in CellProfiler)
Slight modifications to parsing plate, well, site info from filenames fixed in cytomining/pycytominer#210
np.savez_compressed()
can receive multiple arguments. We should consider saving metadata information here in addition to encoding the info in the file name. File names can be overwritten and can be tough to extract from.