Open shntnu opened 11 months ago
@ErinWeisbart @timtreis – please feel free to weigh in on the folder structure below
Location of segmentation
:
cellpainting-gallery/
└── cpg0016-jump
└── source_4
├── images
└── workspace
├── ...
├── segmentation
└── ...
Structure of segmentation
└── segmentation
├── 2021_04_26_Batch1
│ ├── BR00117035
│ │ └── cellpose_<hash>
│ │ ├── BR00117035-A01-1
│ │ │ └── outlines
│ │ │ ├── A01_s1--cell_outlines.png
│ │ │ └── A01_s1--nuclei_outlines.png
│ │ └── BR00117035-A01-2
│ └── BR00117036
└── 2021_05_31_Batch2
analysis
outlines
nesting seems unnecessary but it makes it symmetric with the analysis
folder, which also has cell segmentationshash
in cellpose_<hash>
is some appropriate identifier for the version of cellpose (or your adaptation of it)And regarding this:
How would we then best transfer the files to your AWS so that we can then unpack them there into the correct directories?
We can decide on this closer to when you are ready to go but for now, I'll hand this over to @leoank to ponder
I agree that segmentation
fits in workspace
.
I don't think we need to force the segmentations to comply with the exact same structure of the CellProfiler analysis
outputs as long as the step along Batch-Plate-PlateWellSite nesting remains the same. If we want to allow for the cellpose model or training data or anything else to be included along with the segmentations, I think we need/want the structure to be a bit different.
What do you think about this @shntnu?
(In this case including anything in model
or training
would be optional)
└── segmentation
└── cellpose_<hash>
├── model
├── training
└── outlines
├── 2021_04_26_Batch1
│ ├── BR00117035
│ │ ├── BR00117035-A01-1
│ │ │ ├── BR00117035-A01-1_cells.png
│ │ │ └── BR00117035-A01-1_nuclei.png
│ │ └── BR00117035-A01-2
│ └── BR00117036
└── 2021_05_31_Batch2
What do you think about this @shntnu?
I love it!
I don't think we need to include the CellPose model or its training, we basically use it "off-the-shelf" since it scales the cells internally to the avg cell diameter it was trained on and then goes big again. So that'd be probably a waste of space. We'll put the (snakemake) pipeline we use and maybe some processing public of course, but it mostly just downloads whatever it needs 👍
Thanks @timtreis. No requirement to include anything in model
or training
with your data, I mostly want to ensure that we have a robust structure laid out that will work with future data organization as well.
@timtreis We are all set with the folder structure. Let me know when you are ready to do a test run.
Thanks a lot @ErinWeisbart !
Hey @ErinWeisbart and @shntnu, many thanks for already preparing everything! Our trial on optimized CellPose parameters got slightly delayed because we had to modify the pipeline (turns out building a DAG in snakemake with several million files is slightly suboptimal 🥸). I hope to have the results by early next week (will post here) and will then start with one source so that we can test the transfer workflow? Does that make sense?
@timtreis reported:
For the pilot, we’re now performing the segmentation with different parameters for the nucleus and cytosol parameter in CellPose on a stratified sample of the wells (excluding 9 because the data quality always stood out as poor):
The first run with the new setup is currently running. Once that’s done we’ll see if we perform a 3x3 grid or 5x5 grid for defining the final parameters. This here was the initial analysis based on what we had already downloaded for our hackathon:
The current idea would be to do mean +/- sd (3x3) or mean +/- (1x/0.5x) sd (5x5).
excluding 9 because the data quality always stood out as poor
Can you remind me what the issue was here?
Once that’s done we’ll see if we perform a 3x3 grid or 5x5 grid for defining the final parameters.
I am unclear about the terminology – what is this grid you are referring to? Is it in pixel space?
The current idea would be to do mean +/- sd (3x3) or mean +/- (1x/0.5x) sd (5x5).
Can you clarify? I didn't quite get it :D
excluding 9 because the data quality always stood out as poor
Can you remind me what the issue was here?
During basically all the work on integrating JUMP I've done, source 9 always behaved fundamentally different from the other sources, so I started to exclude it as to not make method dev harder than it is :D During the later parts of the project(s) I'd of course try to include it, but for this benchmark I was afraid of it skewing the distribution in a weird way.
Once that’s done we’ll see if we perform a 3x3 grid or 5x5 grid for defining the final parameters.
I am unclear about the terminology – what is this grid you are referring to? Is it in pixel space?
The current idea would be to do mean +/- sd (3x3) or mean +/- (1x/0.5x) sd (5x5).
Can you clarify? I didn't quite get it :D
(probably rounding those numbers to integers though)
CellPose accepts the roughly expected diameter of the nucleus and cytosol as parameters. So we'll run it on the stratified sample with different permutations of this and compare the number of segmented cells passing a certain (yet to be determined) quality threshold. And only then we do the full data. Ideally, we'd want to not just segment JUMP but segment it well :D
Afterthought: If we see that there are big discrepancies, we could also run a binary search on them.
Also tagging @npeschke who's the fantastic student working with me who's written most of the pipeline code
Thank you very much for the clarifications, @timtreis @npeschke – nice to meet you!
I will post this issue in our internal slack so that people with knowledge and opinions (not me :D) can weigh in
Thank you! Happy to discuss details
Hey @shntnu @ErinWeisbart,
had a meeting with @npeschke yesterday, and during our discussion, a few questions came up. Just throwing them here for discussion.
1) Since we anticipated the primary consumer of the segmented cells to be ML/DL models, we're already splitting them into individual blacked-out tiles, looking like this: Storage-wise, it'd be more efficient to just store the masks and not the channels, but then again, every user would probably perform that step anyway. So we think it'd be best if we directly provide the tiles one can stream into their dataloader. Wdyt?
2) Your sketch of the folder structure suggests that you were thinking of storing one file each for nucleus/cytosol. Similar to the concern from 1), most use cases I see would then require these images to be chopped into individual tiles again. However, there might be cases in which one would want the non-chopped-up image, so we could provide both? That would drastically increase storage needs and present another problem:
3) Do we need to preserve where a given cell was in the original image? Excuse the bad sketch, but theoretically, the segmentation of the entire image should represent the background as 0 and then each area of an identical integer number as a given cell. When we chop these into tiles, though, we end up with binary masks, which naively do not retain info on their original position unless we f.e., preserve this original integer in the filename or metadata.
4) How do we best perform "handover" of the results, however exactly they might look like? I assume we'll have to transfer a few TB where we could already try to imitate your desired structure, but since we don't have access to your S3 filesystem, we couldn't do the final steps there. Would @ErinWeisbart then take over?
I guess the least-storage-intense way would be to provide only the nucleus/cytosol mask files and code to chop that into individual tiles. But that'd require every user to perform this step again and again.
Cheers, Tim
My thoughts, @shntnu as always feel free to disagree:
\1. You have the best intuition of use case and I support making everything as user-friendly as possible! While we don't want to increase data storage thoughtlessly, I think it is quite reasonable to include these crops and we have precedent of storing other image "intermediates" with other datasets.
2-3. I'm attached to folder nesting but not file structure within the folder nesting so chopping is fine. I do think retaining location information is important for preserving the ability to easily map to other profiling data. Again, reasonable expansion of storage is fine. My initial thought is that you could/should save out the per-site segmentation as well and use those integers in the cropped file name. (I suppose an alternative would be providing the x,y coordinate of the cell center in the file name, but that alone wouldn't be my preference).
My preference would be that original channel naming is preserved with the masks at the end, again for easier mapping between data. So the image example you've given above would be Channel 1-5 (as they are in the raw images) and Channel 6-7 would be masks.
And finally, since I love being pedantic, I think "outlines" isn't really sufficient to capture the breadth of data you're providing, so I suggest changing that to "masked_objects".
So if I'm piecing everything together correctly, it would look something like this:
└── segmentation
└── cellpose_<hash>
├── model
├── training
└── masked_objects
├── 2021_04_26_Batch1
│ ├── BR00117035
│ │ ├── BR00117035-A01-1
│ │ │ ├── BR00117035-A01-1_all_cells.png
│ │ │ ├── BR00117035-A01-1_all_nuclei.png
│ │ │ ├── BR00117035-A01-1_cell_001_c1.png
│ │ │ ├── BR00117035-A01-1_cell_001_c2.png
│ │ │ ├── BR00117035-A01-1_cell_001_c3.png
│ │ │ ├── BR00117035-A01-1_cell_001_c4.png
│ │ │ ├── BR00117035-A01-1_cell_001_c5.png
│ │ │ ├── BR00117035-A01-1_cell_001_c6.png
│ │ │ ├── BR00117035-A01-1_cell_001_c7.png
│ │ │ ├── BR00117035-A01-1_cell_002_c1.png
│ │ │ └── ...
│ │ └── BR00117035-A01-2
│ └── BR00117036
└── 2021_05_31_Batch2
\4. The simplest file transfer approach is if you have your files on an S3 bucket, you can make them public and then I can copy the files to the cpg and then we don't have to do any fancy credential handling. This is by far the easiest and therefore our preferred approach for getting data into cpg.
Everything @ErinWeisbart said sounds reasonable to me. Some comments
BR00117035-A01-1_all_cells.png
is a label image, and e.g. BR00117035-A01-1_cell_028_c1.png
is a binary image, so we can look up the coordinates of the latter by imread("BR00117035-A01-1_all_cells.png", as_gray=True)== 28
. That's good enough I supposeBR00117035-A01-1_all_cells.png
will need to be a 16-bit PNG because we can have > 255 objects@shntnu Yes, I meant for _all_cells.png
to be a label image. I have no attachment to exact file format (.png, .tif, .whatever).
Location information could also be exported as a separate locations files (.csv, .txt, .whatever) with object indexed to x,y location (center or bounding box). True to form, my first thought is pictures while your first thought is numbers ;)
Thanks for clarifying @ErinWeisbart
@timtreis @npeschke – back to you
@shntnu Nice to meet you too!
Just to get everyone on the same page regarding the output of our pipeline:
Currently we export the single cell images + masks @timtreis mentioned here into a single hdf5 file per Metadata_Source
The hdf5 file itself has the following group & dataset structure:
└── source_8 <Metadata_Source>
├── ACGUYXCXAPNIKK-UHFFFAOYSA-N <Metadata_InChiKey>
│ ├── source_8__J3__A1170544__P19__1 <image id (joining Source, Batch, Plate, Well and site on '__'>
│ │ ├── single_cell_data (66) <Stack of all n cells with shape (n, 7(5+2masks), x_resolution, y_resolution)>
│ │ └── single_cell_index (66)
│ ├── source_8__J3__A1170544__P19__2
│ │ ├── single_cell_data (47)
│ │ └── single_cell_index (47)
├── ACLUEOBQFRYTQS-UHFFFAOYSA-N
│ ├── source_8__J3__A1170541__F21__1
│ │ ├── single_cell_data (119)
│ │ └── single_cell_index (119)
│ ├── source_8__J3__A1170541__F21__2
│ │ ├── single_cell_data (101)
│ │ └── single_cell_index (101)
...
On the image id level all related data (InChI, Source, Plate, Well, etc.) from the parquet files are also saved as attributes in the hdf5 file. The reason behind having all the data accumulated into one hdf5 file per source being efficient access/loading for ML. But this is just the status quo, of course the output can be adapted.
At the moment, we discard the full frame segmentation during our pipeline but that can be changed easily if you want to have the full masks as well.
I suppose an alternative would be providing the x,y coordinates of the cell center in the file name, but that alone wouldn't be my preference
Agree with this.
My preference would be that original channel naming is preserved with the masks at the end, again for easier mapping between data. So the image example you've given above would be Channel 1-5 (as they are in the raw images) and Channel 6-7 would be masks.
I think Nic and I are ambiguous to this, just sth we'll need to define and then we'll do it this way.
And finally, since I love being pedantic, I think "outlines" isn't really sufficient to capture the breadth of data you're providing, so I suggest changing that to "masked_objects".
Fine for us 👍
The simplest file transfer approach is if you have your files on an S3 bucket, you can make them public and then I can copy the files to the cpg and then we don't have to do any fancy credential handling. This is by far the easiest and therefore our preferred approach for getting data into cpg.
I'm not sure whether we have a S3 bucket where we can temporarily store that data, but I'll ask around.
I don't know how else one would save the coordinates other than via a separate locations file. Oh maybe you are saying that BR00117035-A01-1_all_cells.png is a label image, and e.g. BR00117035-A01-1_cell_028_c1.png is a binary image, so we can look up the coordinates of the latter by imread("BR00117035-A01-1_all_cells.png", as_gray=True)== 28. That's good enough, I suppose
Yeah, that's the idea 👌
For some reason I thought @timtreis and @npeschke were using Parquet, not PNGs, to store the outlines but I might have gotten that wrong.
I think this is something I mentioned in passing. We could theoretically run a pilot in which we store all this data in a (self-ad) https://spatialdata.scverse.org/en/latest/ object, which would f.e. allow us to map an arbitrary number of cells as well as the original image onto a shared coordinate system. But as Nic said, currently, we're writing to hdf5 but are, of course, flexible for storage.
Location information could also be exported as separate locations files (.csv, .txt, .whatever) with object indexed to x,y location (center or bounding box). True to form, my first thought is pictures while your first thought is numbers ;)
I'm not necessarily against this, and the files would be tiny, but this information would be redundant when the filenames indicate the integer from the full label image, right?
One aside that is not terribly consequential for this data set, but might be for future ones - because Cellpose does not create overlapping objects, it's possible to create a label image where imread("BR00117035-A01-1_all_cells.png", as_gray=True)== 28
works. If we want cpg to have data sets with segmentations that may overlap, I'm less excited about a label image than a run-length-encoding, full-sized-binary-mask-per-object (which is a lot of files but with compression should be small), and/or a cropped mask with center and/or defined corner positions (what @timtreis was discussing). A label image is nice for some downstream tasks, so selfishly, great to have that also, but I don't want to lock us into an "objects never overlap" structure.
@timtreis @npeschke @ErinWeisbart – I've summarized our discussion and decisions so far, below. Shall we zoom in the new year to finalize? I've sent us all an invite (added Beth as optional)
training
- Tim/Nic suggest skipping this for this datasetmodel
- same as above but minimally we should have a README.md in here to point to the resource useedLocation of segmentation
(unchanged)
cellpainting-gallery/
└── cpg0016-jump
└── source_4
├── images
└── workspace
├── ...
├── segmentation
└── ...
Structure of segmentation
└── segmentation
└── cellpose_<hash>
├── model
├── training
└── objects
├── 2021_04_26_Batch1
│ ├── BR00117035
│ │ └── BR00117035.zarr
│ ├── ...
│ └── ...
└── 2021_05_31_Batch2
Proposed changes to <plate>.zarr
: Don't use InChIKeys in the index because those could change, and also that won't generalize to other perturbation types
Q: what is the single_cell_index
entity?
└── BR00117035 <Metadata_Plate>
│ ├── source_8__J3__A1170544__P19__1 <image id (joining Source, Batch, Plate, Well and site on '__'>
│ │ ├── single_cell_data (66) <Stack of all n cells with shape (n, 7(5+2masks), x_resolution, y_resolution)>
│ │ └── single_cell_index (66)
│ ├── source_8__J3__A1170544__P19__2
│ │ ├── single_cell_data (47)
│ │ └── single_cell_index (47)
│ ├── source_8__J3__A1170541__F21__1
│ │ ├── single_cell_data (119)
│ │ └── single_cell_index (119)
│ ├── source_8__J3__A1170541__F21__2
│ │ ├── single_cell_data (101)
│ │ └── single_cell_index (101)
...
Structure of single_cell_data
:
Structure of single_cell_index
:
Other notes:
@ErinWeisbart said: "My preference would be that original channel naming is preserved with the masks at the end, again for easier mapping between data. So the image example you've given above would be Channel 1-5 (as they are in the raw images) and Channel 6-7 would be masks." @timtreis said: "I think Nic and I are ambiguous to this, just sth we'll need to define and then we'll do it this way."
Include a channel_mapping.json
file which will have mapping from the index number (of the second dimension of single_cell_data
) to the channel name. Default {0: 'CellMask', 1: 'NucleusMask', 2: 'AGP', 3: 'DNA', 4: 'ER', 5: 'Mito', 6: 'RNA', 7: 'Brightfield', 8: 'BFHigh', 9: 'BFLow'}
.
Should we also store a single large label image, or not? If so, would that live with the Zarr file? Presumably yes. Should we store XY locations as well? If so, all this would fit into the structure of
BR00117035.zarr
like this:
│ ├── source_8__J3__A1170541__F21__2
│ │ ├── label_image
│ │ ├── location
│ │ ├── single_cell_data (101)
│ │ └── single_cell_index (101)
We have now started a Slack channel to discuss this project https://join.slack.com/share/enQtNjM4MTk5NTc5MzYxNy0yNDYxYTM4MThiMThiYTMzMDY0ZDI0NzkyYzJhZWVjNGI0OGU4ODA4MjZjMTVhZWJmNjRhMzgyYjI0YjQzZWQ0
Q: what is the single_cell_index entity?
In case the full segmentation masks of all cells in an image is kept and saved, the single_cell_index
can be used to trace back single cells to the full segmentation mask
Notes from meeting, 2024-01-03:
Should we also store a single large label image, or not? If so, would that live with the Zarr file? Presumably yes. Should we store XY locations as well? If so, all this would fit into the structure of
BR00117035.zarr
like this:│ ├── source_8__J3__A1170541__F21__2 │ │ ├── label_image │ │ ├── location │ │ ├── single_cell_data (101) │ │ └── single_cell_index (101)
Additionally storing the locations would require changes to SPARCSpy (the framework we use to generate the segmentations and extract single cells) itself. Therefore, I would suggest to save the label_image together with the single_cell_data + _index in the zarr file. This would still leave the possibility to trace back single cells to the original image and with additional effort to calculate the centroid again.
I am checking against the current snapshot of https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0016-jump/source_2/workspace/segmentation/cellpose/
Not done:
Done:
Unsure if done or not:
@ErinWeisbart I suppose we should update https://github.com/broadinstitute/cellpainting-gallery/blob/789a5a65d8f8bf653b995b1d176e4adb90885af2/folder_structure.md#segmentation-folder-structure?
If so, I'd request Tim or Nic to draft it if they are able, and we can review
@timtreis @ErinWeisbart – looping back on this
It sounds like Tim is ready to upload more data to the gallery staging bucket.
Before he does
I updated when I converted our docs to a Jupyterbook https://broadinstitute.github.io/cellpainting-gallery/data_structure.html#segmentation-folder-structure but it's worth having @timtreis take a look and see 1) if I've introduced any mistakes 2) if there is anything other knowledge he can transfer there that would be helpful for others in the future
- [ ] add README.md to >model< with CellPose version
Happy to add, what do we want to include? I assume the version of the CellPose model the pipeline delegates to. I could also include a release version of the pipeline repo I made with a link. Anything else?
- [ ] cellpose_>some unique string< to differentiate from potential reruns with a different version
I forgot what we wanted to use for the unique string, but I assume it's arbitrary anyway, right?
- [ ] Ask Sophia if we can spatially trace the crops back to the original
We haven't followed up on this much yet since we weren't sure about the usefulness of this data - wdyt?
- [ ] x/y res is op to pipeline, identical for every image
Don't fully understand this, can you elaborate @shntnu ?
- [ ] blacked out tiles
the pipeline only creates blacked out tiles
- [ ] prase channels from filename, f.e. “xyz_ch1_origDNA” -> ch1
This information is contained in a channel_mapping.json
which exists next to every .zarr file (representing a given plate)
{"0": "NucleusMask",
"1": "CellMask",
"2": "DNA",
"3": "AGP",
"4": "ER",
"5": "Mito",
"6": "RNA"}
So we no longer have this filename information. Am I addressing your point? 🤔
re: someuniquestring
can be anything. We often use date of project start (or a close approximation) in YYYMMDD format as a unique string identifier so I would suggest that?
Note that the checklist I created was by looking at your notes in https://github.com/broadinstitute/cellpainting-gallery/issues/73#issuecomment-1875742288, necessarily what I had in mind :D
- [ ] add README.md to >model< with CellPose version
Happy to add, what do we want to include? I assume the version of the CellPose model the pipeline delegates to. I could also include a release version of the pipeline repo I made with a link. Anything else?
That seems good enough
- [ ] Ask Sophia if we can spatially trace the crops back to the original
We haven't followed up on this much yet since we weren't sure about the usefulness of this data - wdyt?
Does this mean that the crops cannot be currently used to mask the original images? If so, we should just state that in your readme https://github.com/theislab/jump-cpg0016-segmentation/blob/main/README.md and then leave the data as is (no need to figure it out right now)
- [ ] x/y res is op to pipeline, identical for every image
Don't fully understand this, can you elaborate @shntnu ?
No clue :D It was from your notes
- [ ] blacked out tiles
the pipeline only creates blacked out tiles
Ok by blacked out tiles you are referring to the first image in this comment https://github.com/broadinstitute/cellpainting-gallery/issues/73#issuecomment-1854310285
- [ ] prase channels from filename, f.e. “xyz_ch1_origDNA” -> ch1
This information is contained in a
channel_mapping.json
which exists next to every .zarr file (representing a given plate){"0": "NucleusMask", "1": "CellMask", "2": "DNA", "3": "AGP", "4": "ER", "5": "Mito", "6": "RNA"}
Noted; as long as @ErinWeisbart is good with this, I am good with it
So we no longer have this filename information. Am I addressing your point? 🤔
Yes
@timtreis I'd recommend transferring just one source, source_7
for now, so that we can do some manual checks, and then do the rest.
source_7
is the smallest
https://github.com/jump-cellpainting/datasets/blob/main/stats/cpg0016_source_images_tiff_count.csv
In fact, even one plate would be good enough for now.
I updated when I converted our docs to a Jupyterbook https://broadinstitute.github.io/cellpainting-gallery/data_structure.html#segmentation-folder-structure but it's worth having @timtreis take a look and see 1) if I've introduced any mistakes 2) if there is anything other knowledge he can transfer there that would be helpful for others in the future
@timtreis please have a look when you get the chance
Currently finishing this up :) The output that the pipeline now generates is as follows (simplified for a single batch):
└── segmentation
└── cellpose_202404
├── model
│ └──README.md
└── objects
└── 20210719_Run1
└── CP1-SC1-01
├── channel_mapping.json
└── CP1-SC1-01.zarr
├── <source>__<batch>__<plate>__<well>__<site>
│ ├──label_image/
│ ├──single_cell_data/
│ ├──single_cell_index/
│ └──.zgroup
├── ...
⋮
In the README.md, I have now added the following text:
# Notes
These files were created using a Cellpose-based snakemake pipeline. More
information can be found here: https://github.com/theislab/jump-cpg0016-segmentation
## Relevant software versions:
- https://github.com/theislab/jump-cpg0016-segmentation@v0.1.0
- cellpose=2.2.3=pyhd8ed1ab_0
- sparcscore==1.0.0
## Literature
- "Cellpose: a generalist algorithm for cellular segmentation", Stringer et al.,
2021, https://www.nature.com/articles/s41592-020-01018-x
- "Three million images and morphological profiles of cells treated with
matched chemical and genetic perturbations", Chandrasekaran et al., 2024, https://www.biorxiv.org/content/10.1101/2022.01.05.475090v3
wdyt @shntnu @ErinWeisbart ?
Note that the checklist I created was by looking at your notes in #73 (comment), necessarily what I had in mind :D
- [ ] add README.md to >model< with CellPose version
Happy to add, what do we want to include? I assume the version of the CellPose model the pipeline delegates to. I could also include a release version of the pipeline repo I made with a link. Anything else?
That seems good enough
I have now included the release-tagged version of the pipeline, the version of SPARCSpy that we delegate to, and the cellpose build version (see comment above)
- [ ] Ask Sophia if we can spatially trace the crops back to the original
We haven't followed up on this much yet since we weren't sure about the usefulness of this data - wdyt?
Does this mean that the crops cannot be currently used to mask the original images? If so, we should just state that in your readme theislab/jump-cpg0016-segmentation@
main
/README.md and then leave the data as is (no need to figure it out right now)
Yes, we currently cannot do this. I thought you mentioned you already had centroid coordinates from CellProfiler, so extracting a bounding box around those would be fairly comparable (although it'd be a PITA to trace which blacked-out cell belongs to which cropped cell).
- [ ] x/y res is op to pipeline, identical for every image
Don't fully understand this, can you elaborate @shntnu ?
No clue :D It was from your notes
Ah yes, I remember 😅 That was the question whether we scale the tiles in any way to a desired target resolution, but we chose not to.
- [ ] blacked out tiles
the pipeline only creates blacked out tiles
Ok by blacked out tiles you are referring to the first image in this comment #73 (comment)
Yes, except of course as individual images and not a 7x1 strip :)
- [ ] prase channels from filename, f.e. “xyz_ch1_origDNA” -> ch1
This information is contained in a
channel_mapping.json
which exists next to every .zarr file (representing a given plate){"0": "NucleusMask", "1": "CellMask", "2": "DNA", "3": "AGP", "4": "ER", "5": "Mito", "6": "RNA"}
Noted; as long as @ErinWeisbart is good with this, I am good with it
Asked her, she's good with it 👍
So we no longer have this filename information. Am I addressing your point? 🤔
Yes
Cool!
@timtreis thank you so much for your diligence!
Everything looks good to me.
@timtreis I'd recommend transferring just one source,
source_7
for now, so that we can do some manual checks, and then do the rest.
source_7
is the smallest https://github.com/jump-cellpainting/datasets/blob/main/stats/cpg0016_source_images_tiff_count.csvIn fact, even one plate would be good enough for now.
Just a reminder that a small set would be good to start with.
Yes, going to cook dinner and then try that 👌🏻 Ankur provided me with a tutorial
From @ErinWeisbart
This is what I used to generate the sync commands since they needed to have batch added and be performed on a plate-by-plate basis:
import boto3
session = boto3.Session(profile_name='CPGnew') # in ~/.aws/config, section named [profile PROFILENAME], must have key, secret key, region, output
s3 = session.client('s3')
batches = s3.list_objects_v2(Bucket='cellpainting-gallery',Prefix='cpg0016-jump/source_8/images/',Delimiter='/')
batches =[x['Prefix'] for x in batches['CommonPrefixes']]
batchdict = {}
for batch in batches:
plates = s3.list_objects_v2(Bucket='cellpainting-gallery',Prefix=f"{batch}images/",Delimiter='/')
plates =[x['Prefix'].rsplit('/',2)[1] for x in plates['CommonPrefixes']]
batchdict[batch] = plates
for batch in batches:
b = batch.rsplit('/',2)[1]
for plate in batchdict[batch]:
print(f'aws s3 sync s3://staging-cellpainting-gallery/tim_test/{plate}.zarr/ s3://cellpainting-gallery/cpg0016-jump/source_8/workspace/segmentation/cellpose_202404/objects/{b}/{plate}/{plate}.zarr] --profile CPGnew')
@timtreis asked: