Uploading CellPose-segmented outlines for cpg0016

shntnu commented 11 months ago

@timtreis asked:

just wanted to follow up on the segmentation of JUMP. We’re running a small study right now to identify the optimal CellPose parameters for the U2OS cells and would then segment all the images with our improved pipeline (on our compute). How would we then best transfer the files to your AWS so that we can then unpack them there into the correct directories?

shntnu commented 11 months ago

@ErinWeisbart @timtreis – please feel free to weigh in on the folder structure below

Location of segmentation:

cellpainting-gallery/
└── cpg0016-jump
    └── source_4
        ├── images
        └── workspace
            ├── ...
            ├── segmentation
            └── ...

Structure of segmentation

└── segmentation
    ├── 2021_04_26_Batch1
    │   ├── BR00117035
    │   │   └── cellpose_<hash>
    │   │       ├── BR00117035-A01-1
    │   │       │   └── outlines
    │   │       │       ├── A01_s1--cell_outlines.png
    │   │       │       └── A01_s1--nuclei_outlines.png
    │   │       └── BR00117035-A01-2
    │   └── BR00117036
    └── 2021_05_31_Batch2

I followed the same structure as for analysis
The extra outlines nesting seems unnecessary but it makes it symmetric with the analysis folder, which also has cell segmentations
The hash in cellpose_<hash> is some appropriate identifier for the version of cellpose (or your adaptation of it)

And regarding this:

How would we then best transfer the files to your AWS so that we can then unpack them there into the correct directories?

We can decide on this closer to when you are ready to go but for now, I'll hand this over to @leoank to ponder

ErinWeisbart commented 11 months ago

I agree that segmentation fits in workspace.

I don't think we need to force the segmentations to comply with the exact same structure of the CellProfiler analysis outputs as long as the step along Batch-Plate-PlateWellSite nesting remains the same. If we want to allow for the cellpose model or training data or anything else to be included along with the segmentations, I think we need/want the structure to be a bit different.

What do you think about this @shntnu? (In this case including anything in model or training would be optional)

└── segmentation
    └── cellpose_<hash>
        ├── model
        ├── training
        └── outlines
             ├── 2021_04_26_Batch1
             │    ├── BR00117035
             │    │     ├── BR00117035-A01-1
             │    │     │    ├── BR00117035-A01-1_cells.png
             │    │     │    └── BR00117035-A01-1_nuclei.png
             │    │     └── BR00117035-A01-2
             │    └── BR00117036
             └── 2021_05_31_Batch2

shntnu commented 11 months ago

What do you think about this @shntnu?

I love it!

timtreis commented 11 months ago

I don't think we need to include the CellPose model or its training, we basically use it "off-the-shelf" since it scales the cells internally to the avg cell diameter it was trained on and then goes big again. So that'd be probably a waste of space. We'll put the (snakemake) pipeline we use and maybe some processing public of course, but it mostly just downloads whatever it needs 👍

ErinWeisbart commented 11 months ago

Thanks @timtreis. No requirement to include anything in model or training with your data, I mostly want to ensure that we have a robust structure laid out that will work with future data organization as well.

shntnu commented 11 months ago

@timtreis We are all set with the folder structure. Let me know when you are ready to do a test run.

Thanks a lot @ErinWeisbart !

timtreis commented 11 months ago

Hey @ErinWeisbart and @shntnu, many thanks for already preparing everything! Our trial on optimized CellPose parameters got slightly delayed because we had to modify the pipeline (turns out building a DAG in snakemake with several million files is slightly suboptimal 🥸). I hope to have the results by early next week (will post here) and will then start with one source so that we can test the transfer workflow? Does that make sense?

shntnu commented 11 months ago

@timtreis reported:

For the pilot, we’re now performing the segmentation with different parameters for the nucleus and cytosol parameter in CellPose on a stratified sample of the wells (excluding 9 because the data quality always stood out as poor):

The first run with the new setup is currently running. Once that’s done we’ll see if we perform a 3x3 grid or 5x5 grid for defining the final parameters. This here was the initial analysis based on what we had already downloaded for our hackathon:

The current idea would be to do mean +/- sd (3x3) or mean +/- (1x/0.5x) sd (5x5).

shntnu commented 11 months ago

excluding 9 because the data quality always stood out as poor

Can you remind me what the issue was here?

Once that’s done we’ll see if we perform a 3x3 grid or 5x5 grid for defining the final parameters.

I am unclear about the terminology – what is this grid you are referring to? Is it in pixel space?

The current idea would be to do mean +/- sd (3x3) or mean +/- (1x/0.5x) sd (5x5).

Can you clarify? I didn't quite get it :D

timtreis commented 11 months ago

excluding 9 because the data quality always stood out as poor

Can you remind me what the issue was here?

During basically all the work on integrating JUMP I've done, source 9 always behaved fundamentally different from the other sources, so I started to exclude it as to not make method dev harder than it is :D During the later parts of the project(s) I'd of course try to include it, but for this benchmark I was afraid of it skewing the distribution in a weird way.

Once that’s done we’ll see if we perform a 3x3 grid or 5x5 grid for defining the final parameters.

I am unclear about the terminology – what is this grid you are referring to? Is it in pixel space?

The current idea would be to do mean +/- sd (3x3) or mean +/- (1x/0.5x) sd (5x5).

Can you clarify? I didn't quite get it :D

(probably rounding those numbers to integers though)

CellPose accepts the roughly expected diameter of the nucleus and cytosol as parameters. So we'll run it on the stratified sample with different permutations of this and compare the number of segmented cells passing a certain (yet to be determined) quality threshold. And only then we do the full data. Ideally, we'd want to not just segment JUMP but segment it well :D

Afterthought: If we see that there are big discrepancies, we could also run a binary search on them.

Also tagging @npeschke who's the fantastic student working with me who's written most of the pipeline code

shntnu commented 11 months ago

Thank you very much for the clarifications, @timtreis @npeschke – nice to meet you!

I will post this issue in our internal slack so that people with knowledge and opinions (not me :D) can weigh in

timtreis commented 11 months ago

Thank you! Happy to discuss details

timtreis commented 11 months ago

Hey @shntnu @ErinWeisbart,

had a meeting with @npeschke yesterday, and during our discussion, a few questions came up. Just throwing them here for discussion.

1) Since we anticipated the primary consumer of the segmented cells to be ML/DL models, we're already splitting them into individual blacked-out tiles, looking like this: Storage-wise, it'd be more efficient to just store the masks and not the channels, but then again, every user would probably perform that step anyway. So we think it'd be best if we directly provide the tiles one can stream into their dataloader. Wdyt?

2) Your sketch of the folder structure suggests that you were thinking of storing one file each for nucleus/cytosol. Similar to the concern from 1), most use cases I see would then require these images to be chopped into individual tiles again. However, there might be cases in which one would want the non-chopped-up image, so we could provide both? That would drastically increase storage needs and present another problem:

3) Do we need to preserve where a given cell was in the original image? Excuse the bad sketch, but theoretically, the segmentation of the entire image should represent the background as 0 and then each area of an identical integer number as a given cell. When we chop these into tiles, though, we end up with binary masks, which naively do not retain info on their original position unless we f.e., preserve this original integer in the filename or metadata.

4) How do we best perform "handover" of the results, however exactly they might look like? I assume we'll have to transfer a few TB where we could already try to imitate your desired structure, but since we don't have access to your S3 filesystem, we couldn't do the final steps there. Would @ErinWeisbart then take over?

I guess the least-storage-intense way would be to provide only the nucleus/cytosol mask files and code to chop that into individual tiles. But that'd require every user to perform this step again and again.

Cheers, Tim

ErinWeisbart commented 11 months ago

My thoughts, @shntnu as always feel free to disagree:

\1. You have the best intuition of use case and I support making everything as user-friendly as possible! While we don't want to increase data storage thoughtlessly, I think it is quite reasonable to include these crops and we have precedent of storing other image "intermediates" with other datasets.

2-3. I'm attached to folder nesting but not file structure within the folder nesting so chopping is fine. I do think retaining location information is important for preserving the ability to easily map to other profiling data. Again, reasonable expansion of storage is fine. My initial thought is that you could/should save out the per-site segmentation as well and use those integers in the cropped file name. (I suppose an alternative would be providing the x,y coordinate of the cell center in the file name, but that alone wouldn't be my preference).

My preference would be that original channel naming is preserved with the masks at the end, again for easier mapping between data. So the image example you've given above would be Channel 1-5 (as they are in the raw images) and Channel 6-7 would be masks.

And finally, since I love being pedantic, I think "outlines" isn't really sufficient to capture the breadth of data you're providing, so I suggest changing that to "masked_objects".

So if I'm piecing everything together correctly, it would look something like this:

└── segmentation
    └── cellpose_<hash>
        ├── model
        ├── training
        └── masked_objects
             ├── 2021_04_26_Batch1
             │    ├── BR00117035
             │    │     ├── BR00117035-A01-1
             │    │     │    ├── BR00117035-A01-1_all_cells.png
             │    │     │    ├── BR00117035-A01-1_all_nuclei.png
             │    │     │    ├── BR00117035-A01-1_cell_001_c1.png
             │    │     │    ├── BR00117035-A01-1_cell_001_c2.png
             │    │     │    ├── BR00117035-A01-1_cell_001_c3.png
             │    │     │    ├── BR00117035-A01-1_cell_001_c4.png
             │    │     │    ├── BR00117035-A01-1_cell_001_c5.png
             │    │     │    ├── BR00117035-A01-1_cell_001_c6.png
             │    │     │    ├── BR00117035-A01-1_cell_001_c7.png
             │    │     │    ├── BR00117035-A01-1_cell_002_c1.png
             │    │     │    └── ...
             │    │     └── BR00117035-A01-2
             │    └── BR00117036
             └── 2021_05_31_Batch2

\4. The simplest file transfer approach is if you have your files on an S3 bucket, you can make them public and then I can copy the files to the cpg and then we don't have to do any fancy credential handling. This is by far the easiest and therefore our preferred approach for getting data into cpg.

shntnu commented 11 months ago

Everything @ErinWeisbart said sounds reasonable to me. Some comments

I don't know how else one would save the coordinates other than via a separate locations file. Oh maybe you are saying that BR00117035-A01-1_all_cells.png is a label image, and e.g. BR00117035-A01-1_cell_028_c1.png is a binary image, so we can look up the coordinates of the latter by imread("BR00117035-A01-1_all_cells.png", as_gray=True)== 28. That's good enough I suppose
BR00117035-A01-1_all_cells.png will need to be a 16-bit PNG because we can have > 255 objects
For some reason I thought @timtreis and @npeschke were using Parquet, not PNGs, to store the outlines but I might have gotten that wrong

ErinWeisbart commented 11 months ago

@shntnu Yes, I meant for _all_cells.png to be a label image. I have no attachment to exact file format (.png, .tif, .whatever). Location information could also be exported as a separate locations files (.csv, .txt, .whatever) with object indexed to x,y location (center or bounding box). True to form, my first thought is pictures while your first thought is numbers ;)

shntnu commented 11 months ago

Thanks for clarifying @ErinWeisbart

@timtreis @npeschke – back to you

npeschke commented 11 months ago

@shntnu Nice to meet you too!

Just to get everyone on the same page regarding the output of our pipeline: Currently we export the single cell images + masks @timtreis mentioned here into a single hdf5 file per Metadata_Source

The hdf5 file itself has the following group & dataset structure:

└── source_8    <Metadata_Source>
    ├── ACGUYXCXAPNIKK-UHFFFAOYSA-N    <Metadata_InChiKey>
    │   ├── source_8__J3__A1170544__P19__1    <image id (joining Source, Batch, Plate, Well and site on '__'>
    │   │   ├── single_cell_data (66)    <Stack of all n cells with shape (n, 7(5+2masks), x_resolution, y_resolution)>
    │   │   └── single_cell_index (66)
    │   ├── source_8__J3__A1170544__P19__2
    │   │   ├── single_cell_data (47)
    │   │   └── single_cell_index (47)
    ├── ACLUEOBQFRYTQS-UHFFFAOYSA-N
    │   ├── source_8__J3__A1170541__F21__1
    │   │   ├── single_cell_data (119)
    │   │   └── single_cell_index (119)
    │   ├── source_8__J3__A1170541__F21__2
    │   │   ├── single_cell_data (101)
    │   │   └── single_cell_index (101)
     ...

On the image id level all related data (InChI, Source, Plate, Well, etc.) from the parquet files are also saved as attributes in the hdf5 file. The reason behind having all the data accumulated into one hdf5 file per source being efficient access/loading for ML. But this is just the status quo, of course the output can be adapted.

At the moment, we discard the full frame segmentation during our pipeline but that can be changed easily if you want to have the full masks as well.

timtreis commented 11 months ago

I suppose an alternative would be providing the x,y coordinates of the cell center in the file name, but that alone wouldn't be my preference

Agree with this.

My preference would be that original channel naming is preserved with the masks at the end, again for easier mapping between data. So the image example you've given above would be Channel 1-5 (as they are in the raw images) and Channel 6-7 would be masks.

I think Nic and I are ambiguous to this, just sth we'll need to define and then we'll do it this way.

And finally, since I love being pedantic, I think "outlines" isn't really sufficient to capture the breadth of data you're providing, so I suggest changing that to "masked_objects".

Fine for us 👍

The simplest file transfer approach is if you have your files on an S3 bucket, you can make them public and then I can copy the files to the cpg and then we don't have to do any fancy credential handling. This is by far the easiest and therefore our preferred approach for getting data into cpg.

I'm not sure whether we have a S3 bucket where we can temporarily store that data, but I'll ask around.

I don't know how else one would save the coordinates other than via a separate locations file. Oh maybe you are saying that BR00117035-A01-1_all_cells.png is a label image, and e.g. BR00117035-A01-1_cell_028_c1.png is a binary image, so we can look up the coordinates of the latter by imread("BR00117035-A01-1_all_cells.png", as_gray=True)== 28. That's good enough, I suppose

Yeah, that's the idea 👌

For some reason I thought @timtreis and @npeschke were using Parquet, not PNGs, to store the outlines but I might have gotten that wrong.

I think this is something I mentioned in passing. We could theoretically run a pilot in which we store all this data in a (self-ad) https://spatialdata.scverse.org/en/latest/ object, which would f.e. allow us to map an arbitrary number of cells as well as the original image onto a shared coordinate system. But as Nic said, currently, we're writing to hdf5 but are, of course, flexible for storage.

Location information could also be exported as separate locations files (.csv, .txt, .whatever) with object indexed to x,y location (center or bounding box). True to form, my first thought is pictures while your first thought is numbers ;)

I'm not necessarily against this, and the files would be tiny, but this information would be redundant when the filenames indicate the integer from the full label image, right?

bethac07 commented 11 months ago

One aside that is not terribly consequential for this data set, but might be for future ones - because Cellpose does not create overlapping objects, it's possible to create a label image where imread("BR00117035-A01-1_all_cells.png", as_gray=True)== 28 works. If we want cpg to have data sets with segmentations that may overlap, I'm less excited about a label image than a run-length-encoding, full-sized-binary-mask-per-object (which is a lot of files but with compression should be small), and/or a cropped mask with center and/or defined corner positions (what @timtreis was discussing). A label image is nice for some downstream tasks, so selfishly, great to have that also, but I don't want to lock us into an "objects never overlap" structure.

shntnu commented 11 months ago

@timtreis @npeschke @ErinWeisbart – I've summarized our discussion and decisions so far, below. Shall we zoom in the new year to finalize? I've sent us all an invite (added Beth as optional)

training - Tim/Nic suggest skipping this for this dataset
model- same as above but minimally we should have a README.md in here to point to the resource useed

Location of segmentation (unchanged)

cellpainting-gallery/
└── cpg0016-jump
    └── source_4
        ├── images
        └── workspace
            ├── ...
            ├── segmentation
            └── ...

Structure of segmentation

└── segmentation
    └── cellpose_<hash>
        ├── model 
        ├── training 
        └── objects
             ├── 2021_04_26_Batch1
             │    ├── BR00117035
             │    │     └── BR00117035.zarr
             │    ├── ...
             │    └── ...
             └── 2021_05_31_Batch2

Proposed changes to <plate>.zarr: Don't use InChIKeys in the index because those could change, and also that won't generalize to other perturbation types

Q: what is the single_cell_index entity?

└── BR00117035    <Metadata_Plate>
    │   ├── source_8__J3__A1170544__P19__1    <image id (joining Source, Batch, Plate, Well and site on '__'>
    │   │   ├── single_cell_data (66)    <Stack of all n cells with shape (n, 7(5+2masks), x_resolution, y_resolution)>
    │   │   └── single_cell_index (66)
    │   ├── source_8__J3__A1170544__P19__2
    │   │   ├── single_cell_data (47)
    │   │   └── single_cell_index (47)
    │   ├── source_8__J3__A1170541__F21__1
    │   │   ├── single_cell_data (119)
    │   │   └── single_cell_index (119)
    │   ├── source_8__J3__A1170541__F21__2
    │   │   ├── single_cell_data (101)
    │   │   └── single_cell_index (101)
     ...

Structure of single_cell_data:

Stack of all n cells with shape (n, 7(5+2masks), x_resolution, y_resolution); I believe this addresses @bethac07 's concern in https://github.com/broadinstitute/cellpainting-gallery/issues/73#issuecomment-1855874705

Structure of single_cell_index:

This is internal to the SPARCSpy package; a reverse index of sorts

Other notes:

@ErinWeisbart said: "My preference would be that original channel naming is preserved with the masks at the end, again for easier mapping between data. So the image example you've given above would be Channel 1-5 (as they are in the raw images) and Channel 6-7 would be masks." @timtreis said: "I think Nic and I are ambiguous to this, just sth we'll need to define and then we'll do it this way."

Include a channel_mapping.json file which will have mapping from the index number (of the second dimension of single_cell_data) to the channel name. Default {0: 'CellMask', 1: 'NucleusMask', 2: 'AGP', 3: 'DNA', 4: 'ER', 5: 'Mito', 6: 'RNA', 7: 'Brightfield', 8: 'BFHigh', 9: 'BFLow'}.

Should we also store a single large label image, or not? If so, would that live with the Zarr file? Presumably yes. Should we store XY locations as well? If so, all this would fit into the structure of BR00117035.zarr like this:

    │   ├── source_8__J3__A1170541__F21__2
    │   │   ├── label_image
    │   │   ├── location
    │   │   ├── single_cell_data (101)
    │   │   └── single_cell_index (101)

Getting locations is not trivial so we should ...
[ ] @npeschke please write down the decision here

shntnu commented 11 months ago

We have now started a Slack channel to discuss this project https://join.slack.com/share/enQtNjM4MTk5NTc5MzYxNy0yNDYxYTM4MThiMThiYTMzMDY0ZDI0NzkyYzJhZWVjNGI0OGU4ODA4MjZjMTVhZWJmNjRhMzgyYjI0YjQzZWQ0

npeschke commented 10 months ago

Q: what is the single_cell_index entity?

In case the full segmentation masks of all cells in an image is kept and saved, the single_cell_index can be used to trace back single cells to the full segmentation mask

timtreis commented 10 months ago

Notes from meeting, 2024-01-03:

Leave >training< empty for our case
add README.md to >model< with CellPose version
cellpose_>some unique string< to differentiate from potential reruns with a different version
change “outlines” to “objects”
Go cloud-native and store as .zarr
Store one .zarr per plate
Leave InChiKey out of name
x/y res is op to pipeline, identical for every image
blacked out tiles
prase channels from filename, f.e. “xyz_ch1_origDNA” -> ch1
Store channel mapping at plate level
Ask Sophia if we can spatially trace the crops back to the original
Keep “batch” as a semantic folder

npeschke commented 10 months ago

Should we also store a single large label image, or not? If so, would that live with the Zarr file? Presumably yes. Should we store XY locations as well? If so, all this would fit into the structure of BR00117035.zarr like this:
    │   ├── source_8__J3__A1170541__F21__2
    │   │   ├── label_image
    │   │   ├── location
    │   │   ├── single_cell_data (101)
    │   │   └── single_cell_index (101)

Additionally storing the locations would require changes to SPARCSpy (the framework we use to generate the segmentations and extract single cells) itself. Therefore, I would suggest to save the label_image together with the single_cell_data + _index in the zarr file. This would still leave the possibility to trace back single cells to the original image and with additional effort to calculate the centroid again.

shntnu commented 8 months ago

I am checking against the current snapshot of https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0016-jump/source_2/workspace/segmentation/cellpose/

Not done:

[ ] add README.md to >model< with CellPose version
[ ] cellpose_>some unique string< to differentiate from potential reruns with a different version

Done:

[x] Leave >training< empty for our case
[x] change “outlines” to “objects”
[x] Go cloud-native and store as .zarr
[x] Store one .zarr per plate
[x] Store channel mapping at plate level
[x] Keep “batch” as a semantic folder

Unsure if done or not:

[ ] Ask Sophia if we can spatially trace the crops back to the original
[x] Leave InChiKey out of name
[ ] x/y res is op to pipeline, identical for every image
[ ] blacked out tiles
[ ] prase channels from filename, f.e. “xyz_ch1_origDNA” -> ch1

shntnu commented 8 months ago

@ErinWeisbart I suppose we should update https://github.com/broadinstitute/cellpainting-gallery/blob/789a5a65d8f8bf653b995b1d176e4adb90885af2/folder_structure.md#segmentation-folder-structure?

If so, I'd request Tim or Nic to draft it if they are able, and we can review

shntnu commented 7 months ago

@timtreis @ErinWeisbart – looping back on this

It sounds like Tim is ready to upload more data to the gallery staging bucket.

Before he does

@timtreis -- possible for you to review this checklist? it is harder to move around files later so better to check that before uploading. I have lost track of which ones are essential vs. good to have but the rest of you remember
@ErinWeisbart -- can you make a call on what should happen here https://github.com/broadinstitute/cellpainting-gallery/issues/73#issuecomment-1992844995? That is, assuming that the structure is settled, we should update the docs? (and Tim or Nic would be the right people to do it, because they know all the internals)

ErinWeisbart commented 7 months ago

I updated when I converted our docs to a Jupyterbook https://broadinstitute.github.io/cellpainting-gallery/data_structure.html#segmentation-folder-structure but it's worth having @timtreis take a look and see 1) if I've introduced any mistakes 2) if there is anything other knowledge he can transfer there that would be helpful for others in the future

timtreis commented 7 months ago

[ ] add README.md to >model< with CellPose version

Happy to add, what do we want to include? I assume the version of the CellPose model the pipeline delegates to. I could also include a release version of the pipeline repo I made with a link. Anything else?

[ ] cellpose_>some unique string< to differentiate from potential reruns with a different version

I forgot what we wanted to use for the unique string, but I assume it's arbitrary anyway, right?

[ ] Ask Sophia if we can spatially trace the crops back to the original

We haven't followed up on this much yet since we weren't sure about the usefulness of this data - wdyt?

[ ] x/y res is op to pipeline, identical for every image

Don't fully understand this, can you elaborate @shntnu ?

[ ] blacked out tiles

the pipeline only creates blacked out tiles

[ ] prase channels from filename, f.e. “xyz_ch1_origDNA” -> ch1

This information is contained in a channel_mapping.json which exists next to every .zarr file (representing a given plate)

{"0": "NucleusMask", 
"1": "CellMask", 
"2": "DNA", 
"3": "AGP", 
"4": "ER", 
"5": "Mito", 
"6": "RNA"}

So we no longer have this filename information. Am I addressing your point? 🤔

ErinWeisbart commented 7 months ago

re: someuniquestring can be anything. We often use date of project start (or a close approximation) in YYYMMDD format as a unique string identifier so I would suggest that?

shntnu commented 7 months ago

Note that the checklist I created was by looking at your notes in https://github.com/broadinstitute/cellpainting-gallery/issues/73#issuecomment-1875742288, necessarily what I had in mind :D

[ ] add README.md to >model< with CellPose version

Happy to add, what do we want to include? I assume the version of the CellPose model the pipeline delegates to. I could also include a release version of the pipeline repo I made with a link. Anything else?

That seems good enough

[ ] Ask Sophia if we can spatially trace the crops back to the original

We haven't followed up on this much yet since we weren't sure about the usefulness of this data - wdyt?

Does this mean that the crops cannot be currently used to mask the original images? If so, we should just state that in your readme https://github.com/theislab/jump-cpg0016-segmentation/blob/main/README.md and then leave the data as is (no need to figure it out right now)

[ ] x/y res is op to pipeline, identical for every image

Don't fully understand this, can you elaborate @shntnu ?

No clue :D It was from your notes

[ ] blacked out tiles

the pipeline only creates blacked out tiles

Ok by blacked out tiles you are referring to the first image in this comment https://github.com/broadinstitute/cellpainting-gallery/issues/73#issuecomment-1854310285

[ ] prase channels from filename, f.e. “xyz_ch1_origDNA” -> ch1

This information is contained in a channel_mapping.json which exists next to every .zarr file (representing a given plate)
{"0": "NucleusMask", 
"1": "CellMask", 
"2": "DNA", 
"3": "AGP", 
"4": "ER", 
"5": "Mito", 
"6": "RNA"}

Noted; as long as @ErinWeisbart is good with this, I am good with it

So we no longer have this filename information. Am I addressing your point? 🤔

Yes

shntnu commented 7 months ago

@timtreis I'd recommend transferring just one source, source_7 for now, so that we can do some manual checks, and then do the rest.

source_7 is the smallest https://github.com/jump-cellpainting/datasets/blob/main/stats/cpg0016_source_images_tiff_count.csv

In fact, even one plate would be good enough for now.

shntnu commented 7 months ago

I updated when I converted our docs to a Jupyterbook https://broadinstitute.github.io/cellpainting-gallery/data_structure.html#segmentation-folder-structure but it's worth having @timtreis take a look and see 1) if I've introduced any mistakes 2) if there is anything other knowledge he can transfer there that would be helpful for others in the future

@timtreis please have a look when you get the chance

timtreis commented 6 months ago

Currently finishing this up :) The output that the pipeline now generates is as follows (simplified for a single batch):

└── segmentation
    └── cellpose_202404
        ├── model
        │   └──README.md
        └── objects
            └── 20210719_Run1
                └── CP1-SC1-01
                     ├── channel_mapping.json
                     └── CP1-SC1-01.zarr
                         ├── <source>__<batch>__<plate>__<well>__<site>
                         │   ├──label_image/
                         │   ├──single_cell_data/
                         │   ├──single_cell_index/
                         │   └──.zgroup
                         ├── ...
                         ⋮

In the README.md, I have now added the following text:

# Notes

These files were created using a Cellpose-based snakemake pipeline. More
information can be found here: https://github.com/theislab/jump-cpg0016-segmentation

## Relevant software versions:
- https://github.com/theislab/jump-cpg0016-segmentation@v0.1.0
- cellpose=2.2.3=pyhd8ed1ab_0
- sparcscore==1.0.0

## Literature
- "Cellpose: a generalist algorithm for cellular segmentation", Stringer et al.,
  2021, https://www.nature.com/articles/s41592-020-01018-x
- "Three million images and morphological profiles of cells treated with
  matched chemical and genetic perturbations", Chandrasekaran et al., 2024, https://www.biorxiv.org/content/10.1101/2022.01.05.475090v3

wdyt @shntnu @ErinWeisbart ?

timtreis commented 6 months ago

Note that the checklist I created was by looking at your notes in #73 (comment), necessarily what I had in mind :D

[ ] add README.md to >model< with CellPose version

Happy to add, what do we want to include? I assume the version of the CellPose model the pipeline delegates to. I could also include a release version of the pipeline repo I made with a link. Anything else?

That seems good enough

I have now included the release-tagged version of the pipeline, the version of SPARCSpy that we delegate to, and the cellpose build version (see comment above)

[ ] Ask Sophia if we can spatially trace the crops back to the original

We haven't followed up on this much yet since we weren't sure about the usefulness of this data - wdyt?

Does this mean that the crops cannot be currently used to mask the original images? If so, we should just state that in your readme theislab/jump-cpg0016-segmentation@main/README.md and then leave the data as is (no need to figure it out right now)

Yes, we currently cannot do this. I thought you mentioned you already had centroid coordinates from CellProfiler, so extracting a bounding box around those would be fairly comparable (although it'd be a PITA to trace which blacked-out cell belongs to which cropped cell).

[ ] x/y res is op to pipeline, identical for every image

Don't fully understand this, can you elaborate @shntnu ?

No clue :D It was from your notes

Ah yes, I remember 😅 That was the question whether we scale the tiles in any way to a desired target resolution, but we chose not to.

[ ] blacked out tiles

the pipeline only creates blacked out tiles

Ok by blacked out tiles you are referring to the first image in this comment #73 (comment)

Yes, except of course as individual images and not a 7x1 strip :)

[ ] prase channels from filename, f.e. “xyz_ch1_origDNA” -> ch1

This information is contained in a channel_mapping.json which exists next to every .zarr file (representing a given plate)
{"0": "NucleusMask", 
"1": "CellMask", 
"2": "DNA", 
"3": "AGP", 
"4": "ER", 
"5": "Mito", 
"6": "RNA"}
Noted; as long as @ErinWeisbart is good with this, I am good with it

Asked her, she's good with it 👍

So we no longer have this filename information. Am I addressing your point? 🤔

Yes

Cool!

shntnu commented 6 months ago

@timtreis thank you so much for your diligence!

Everything looks good to me.

shntnu commented 6 months ago

@timtreis I'd recommend transferring just one source, source_7 for now, so that we can do some manual checks, and then do the rest.

source_7 is the smallest https://github.com/jump-cellpainting/datasets/blob/main/stats/cpg0016_source_images_tiff_count.csv

In fact, even one plate would be good enough for now.

Just a reminder that a small set would be good to start with.

timtreis commented 6 months ago

Yes, going to cook dinner and then try that 👌🏻 Ankur provided me with a tutorial

shntnu commented 3 weeks ago

From @ErinWeisbart

This is what I used to generate the sync commands since they needed to have batch added and be performed on a plate-by-plate basis:

import boto3

session = boto3.Session(profile_name='CPGnew') # in ~/.aws/config, section named [profile PROFILENAME], must have key, secret key, region, output
s3 = session.client('s3')

batches = s3.list_objects_v2(Bucket='cellpainting-gallery',Prefix='cpg0016-jump/source_8/images/',Delimiter='/')
batches =[x['Prefix'] for x in batches['CommonPrefixes']]

batchdict = {}
for batch in batches:
    plates = s3.list_objects_v2(Bucket='cellpainting-gallery',Prefix=f"{batch}images/",Delimiter='/')
    plates =[x['Prefix'].rsplit('/',2)[1] for x in plates['CommonPrefixes']]
    batchdict[batch] = plates

for batch in batches:
    b = batch.rsplit('/',2)[1]
    for plate in batchdict[batch]:
        print(f'aws s3 sync s3://staging-cellpainting-gallery/tim_test/{plate}.zarr/ s3://cellpainting-gallery/cpg0016-jump/source_8/workspace/segmentation/cellpose_202404/objects/{b}/{plate}/{plate}.zarr] --profile CPGnew')

broadinstitute / cellpainting-gallery

Uploading CellPose-segmented outlines for cpg0016 #73