broadinstitute / cellpainting-gallery

Cell Painting Gallery
https://broadinstitute.github.io/cellpainting-gallery/
MIT License
51 stars 8 forks source link

2023_05_19_BarcodeCalling (cpg0021) #47

Closed MarziehHaghighi closed 10 months ago

MarziehHaghighi commented 1 year ago

[Link to publication repo]
cellpainting-gallery identifier = cpg0021-periscope

Transfer to CellPainting Gallery:

If data is being published, prepare for publication:

Once published:

MarziehHaghighi commented 1 year ago

I want to transfer the benchmark dataset for the barcode calling project to the gallery. The images are unadjusted images an are currently residing at this folder and take 1.83 TB of space.

Also, because we used the cell segmentation mask for cell calling step, I was thinking that we should publish them along the images. But we dont want to publish the whole analysis folder because the data there is not used in the paper. Let me know how to proceed.

shntnu commented 1 year ago

Thanks for getting this started @MarziehHaghighi

Let's handle images first, masks next.

Images:

Can you provide aws s3 sync instructions to copy them over to s3://cellpainting-gallery? The final folder structure should look like this https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md#pooled-cell-painting-experiments (just for the images alone)

MarziehHaghighi commented 1 year ago

We are just releasing one specific level of ISS images (which also doesnt exist at all in the usual workflow). Considering that, do we still want to follow that pattern?

I suggest this structure:

cellpainting-gallery
└── cpg0027-haghighi-barcodecalling
    └── broad
        ├── images
        │   └── 20210124_6W_CP228
        │      └── iss_images

Does that work? If so it would be this single command for transferring images:

aws s3 sync s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/20210124_6W_CP228/images_aligned_stitched/images_corrected_cropped s3://cellpainting-gallery/cpg0027-haghighi-barcodecalling/broad/images/20210124_6W_CP228/iss_images

Please note that we are not releasing a pooled cell paining plate to follow the standards there. We release a set of illumination corrected stiched and cropped but unadjusted In-Situ-Sequencing images. Very specific!

shntnu commented 1 year ago

Oh I see

@ErinWeisbart can you comment on what's good here?

Hopefully this is very simple

ErinWeisbart commented 1 year ago

Hmmm... given that we don't plan on providing any other context for this specific batch of images, I'm okay with not having the parent folder indicate the processing it has received and instead calling it ISS_images as Marzieh has suggested.

However, I do have a bigger picture question - we are not releasing corresponding Cell Painting images to go with the ISS images, correct? If that is the case, then does it belong in the cpg?

shntnu commented 1 year ago

However, I do have a bigger picture question - we are not releasing corresponding Cell Painting images to go with the ISS images, correct?

Correct

If that is the case, then does it belong in the cpg?

Hm good point. We've been flexible with keeping Cell Painting-adjacent data in CPG but this could be pushing it :) Howeve it is so tightly linked to Cell Painting that it does feel "right"

What if we place this in cpg0021-periscope? Would make it more in line with the goal of this resource @ErinWeisbart?

ErinWeisbart commented 1 year ago

Yes, I do think it fits within cpg0021-periscope. If we do that, then the images can go into s3://cellpainting-gallery/cpg0021-periscope/broad/images/20210124_6W_CP228/images_aligned_cropped/ and any paper-specific data or metadata that we want to keep with it would go into s3://cellpainting-gallery/cpg0021-periscope/broad/workspace/publication_data/2023_Haghighi

So the image sync command would be:

aws s3 sync s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/20210124_6W_CP228/images_aligned_stitched/images_corrected_cropped s3://cellpainting-gallery/cpg0021-periscope/broad/images/20210124_6W_CP228/images_aligned_cropped/ 

I also think it's worth syncing the 10X stitches to be consistent with the other datasets (since we have them):

aws s3 sync s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/20210124_6W_CP228/images_aligned_stitched/images_corrected_stitched_10X s3://cellpainting-gallery/cpg0021-periscope/broad/images/20210124_6W_CP228/images_aligned_stitched_10X
MarziehHaghighi commented 1 year ago

My only concern for doing that (with that specific naming suggestion) is that description of images_aligned_cropped from the standard workflow document may not completely match with what we have here for this batch. Just because we have unadjusted images instead of adjusted ones in a regular case.

ErinWeisbart commented 1 year ago

We never generate images_aligned_cropped in the standard workflow and it's not described in the standard workflow document. We generate images_corrected_cropped from the images_corrected and so the naming is consistent in that here, we followed the same stitch/crop process to generate images_aligned_cropped from the images_aligned

MarziehHaghighi commented 1 year ago

Great! Makes sense! @shntnu here is script containing the commands for transferring overlay files (which we use for cell calling and therefor need to be added to the resource)

overlay_transfer_commands.sh

shntnu commented 1 year ago

@ErinWeisbart possible to make these two prefixes public so I can sync?

s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/20210124_6W_CP228/images_aligned_stitched/images_corrected_cropped
s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/
ErinWeisbart commented 1 year ago

I suggest these two prefixes instead:

s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/20210124_6W_CP228/images_aligned_stitched/
s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/

The first edited so that you can sync both images_corrected_cropped and images_corrected_stitched_10X as I've suggested above. The second so that it doesn't reveal analyses from other batches.

MarziehHaghighi commented 1 year ago

@ErinWeisbart Could you please let me know what would be the difference in images_corrected_stitched_10X and the specific level of data we want to make public? I'm hesitant in providing any unnecessary level of data "for the context of this barcode calling benchmark".

ErinWeisbart commented 1 year ago

images_corrected_stitched_10X is a 10X downscale of the full-well stitch. It is not something we ever use in analysis for any of our workflows, but provides a nice overview of how well the stitching worked (which is a step that can generate major artifacts if not done well). I think we should include it because we already generated it, we are including it with all other PERISCOPE datasets that undergo stitching and cropping (all pooled datasets), and it is a simple but important QC checkmark.

However, if you think it is problematic to include anything extra, I defer to you.

MarziehHaghighi commented 1 year ago

Ok that is fine as long as the images are also "unadjusted" as for what we use for the input to all benchmarking experiments.

ErinWeisbart commented 1 year ago

Yes, these stitches coming from images_aligned_stitched/images_corrected_stitched_10X directly correspond to the images_aligned_stitched/images_corrected_cropped

(To be totally clear, we do have images_corrected_stitched_10X and images_corrected_cropped folders that are within the 2018_11_20_Periscope_Calico/20210124_6W_CP228/ folder and those are NOT what we are syncing. Those are the corrected images used for traditional PERISCOPE analysis. The images_aligned_stitched/images_corrected_stitched_10X and images_aligned_stitched/images_corrected_cropped folders are the aligned but NOT corrected images - the images_corrected_stitched prefix was hardcoded into our workflow so I nested it into the images_aligned_stitched folder when I generated the images specific to Marzieh's project. Please note that the sync commands that I suggested above simplify the odd naming/nesting induced by the hardcoding in our workflow and make it consistent with the naming conventions of the project (while still indicating that this is a data level not usually produced.)

shntnu commented 1 year ago

I suggest these two prefixes instead:

Thanks! Would you be able to make this update in the bucket policy @ErinWeisbart ? I don't have access, and you if don't either, then we will need to ping Beth

ErinWeisbart commented 1 year ago

Would you be able to make this update in the bucket policy

done

shntnu commented 1 year ago

@MarziehHaghighi

There is some issue with the Overlay paths:

This path does not exist

aws s3 ls  s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/CorrDNA_Site_12_Overlay.png

This is the listing one level up:

aws s3 ls  s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/|head
                           PRE CP228A-Well1-1/
                           PRE CP228A-Well1-10/
                           PRE CP228A-Well1-100/
                           PRE CP228A-Well1-11/
                           PRE CP228A-Well1-12/
                           PRE CP228A-Well1-13/
                           PRE CP228A-Well1-14/
                           PRE CP228A-Well1-15/
                           PRE CP228A-Well1-16/
                           PRE CP228A-Well1-17/

Can you fix the paths and ping me?

ErinWeisbart commented 1 year ago

That path you have listed should not exist. All analysis data is saved into a per-site folder. I don't know what your sync command is currently set up as, but if you want to sync just overlays, you'll need the right include/exclude flags like


aws s3 sync s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/ DESTINATION --exclude "*" --include "*_Overlay.png"
MarziehHaghighi commented 1 year ago

@Erin Oh nice that is convenient!

@shntnu sorry for that here is the corrected version:

overlay_transfer_commands_updated.txt

You may want to use Erin's suggestion instead though.

shntnu commented 1 year ago

You may want to use Erin's suggestion instead though.

@MarziehHaghighi So I do this, correct?

aws s3 sync \
  s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/ \
  s3://cellpainting-gallery/cpg0021-periscope/broad/workspace/analysis/20210124_6W_CP228/ \
  --exclude "*" \
  --include "*_Overlay.png"
MarziehHaghighi commented 1 year ago

@shntnu could you please transfer Nuclei.csv files from the analysis folder as well? Thanks.

aws s3 sync \
  s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/ \
  s3://cellpainting-gallery/cpg0021-periscope/broad/workspace/analysis/20210124_6W_CP228/ \
  --exclude "*" \
  --include "Nuclei.csv"
shntnu commented 1 year ago

@MarziehHaghighi bumping this back to you now that you are all set up to directly copy data over

MarziehHaghighi commented 11 months ago

Marzieh is working on the following:

aws s3 sync \
  s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/20210124_6W_CP228/images_aligned_stitched/images_corrected_cropped \
  s3://cellpainting-gallery/cpg0021-periscope/broad/images/20210124_6W_CP228/images_aligned_cropped --profile gallery
aws s3 sync \
  s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/ \
  s3://cellpainting-gallery/cpg0021-periscope/broad/workspace/analysis/20210124_6W_CP228/ \
  --exclude "*" \
  --include "*/Nuclei.csv" --profile gallery

Update: I confirm the transfers are complete now (August 9, 2023)