Closed MarziehHaghighi closed 1 year ago
I want to transfer the benchmark dataset for the barcode calling project to the gallery. The images are unadjusted images an are currently residing at this folder and take 1.83 TB of space.
Also, because we used the cell segmentation mask for cell calling step, I was thinking that we should publish them along the images. But we dont want to publish the whole analysis folder because the data there is not used in the paper. Let me know how to proceed.
Thanks for getting this started @MarziehHaghighi
Let's handle images first, masks next.
Images:
Can you provide aws s3 sync
instructions to copy them over to s3://cellpainting-gallery? The final folder structure should look like this https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md#pooled-cell-painting-experiments (just for the images alone)
We are just releasing one specific level of ISS images (which also doesnt exist at all in the usual workflow). Considering that, do we still want to follow that pattern?
I suggest this structure:
cellpainting-gallery
└── cpg0027-haghighi-barcodecalling
└── broad
├── images
│ └── 20210124_6W_CP228
│ └── iss_images
Does that work? If so it would be this single command for transferring images:
aws s3 sync s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/20210124_6W_CP228/images_aligned_stitched/images_corrected_cropped s3://cellpainting-gallery/cpg0027-haghighi-barcodecalling/broad/images/20210124_6W_CP228/iss_images
Please note that we are not releasing a pooled cell paining plate to follow the standards there. We release a set of illumination corrected stiched and cropped but unadjusted In-Situ-Sequencing images. Very specific!
Oh I see
@ErinWeisbart can you comment on what's good here?
Hopefully this is very simple
Hmmm... given that we don't plan on providing any other context for this specific batch of images, I'm okay with not having the parent folder indicate the processing it has received and instead calling it ISS_images
as Marzieh has suggested.
However, I do have a bigger picture question - we are not releasing corresponding Cell Painting images to go with the ISS images, correct? If that is the case, then does it belong in the cpg?
However, I do have a bigger picture question - we are not releasing corresponding Cell Painting images to go with the ISS images, correct?
Correct
If that is the case, then does it belong in the cpg?
Hm good point. We've been flexible with keeping Cell Painting-adjacent data in CPG but this could be pushing it :) Howeve it is so tightly linked to Cell Painting that it does feel "right"
What if we place this in cpg0021-periscope
? Would make it more in line with the goal of this resource @ErinWeisbart?
Yes, I do think it fits within cpg0021-periscope
. If we do that, then the images can go into s3://cellpainting-gallery/cpg0021-periscope/broad/images/20210124_6W_CP228/images_aligned_cropped/
and any paper-specific data or metadata that we want to keep with it would go into s3://cellpainting-gallery/cpg0021-periscope/broad/workspace/publication_data/2023_Haghighi
So the image sync command would be:
aws s3 sync s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/20210124_6W_CP228/images_aligned_stitched/images_corrected_cropped s3://cellpainting-gallery/cpg0021-periscope/broad/images/20210124_6W_CP228/images_aligned_cropped/
I also think it's worth syncing the 10X stitches to be consistent with the other datasets (since we have them):
aws s3 sync s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/20210124_6W_CP228/images_aligned_stitched/images_corrected_stitched_10X s3://cellpainting-gallery/cpg0021-periscope/broad/images/20210124_6W_CP228/images_aligned_stitched_10X
My only concern for doing that (with that specific naming suggestion) is that description of images_aligned_cropped from the standard workflow document may not completely match with what we have here for this batch. Just because we have unadjusted images instead of adjusted ones in a regular case.
We never generate images_aligned_cropped
in the standard workflow and it's not described in the standard workflow document. We generate images_corrected_cropped
from the images_corrected
and so the naming is consistent in that here, we followed the same stitch/crop process to generate images_aligned_cropped
from the images_aligned
Great! Makes sense! @shntnu here is script containing the commands for transferring overlay files (which we use for cell calling and therefor need to be added to the resource)
@ErinWeisbart possible to make these two prefixes public so I can sync?
s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/20210124_6W_CP228/images_aligned_stitched/images_corrected_cropped
s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/
I suggest these two prefixes instead:
s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/20210124_6W_CP228/images_aligned_stitched/
s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/
The first edited so that you can sync both images_corrected_cropped
and images_corrected_stitched_10X
as I've suggested above.
The second so that it doesn't reveal analyses from other batches.
@ErinWeisbart Could you please let me know what would be the difference in images_corrected_stitched_10X and the specific level of data we want to make public? I'm hesitant in providing any unnecessary level of data "for the context of this barcode calling benchmark".
images_corrected_stitched_10X
is a 10X downscale of the full-well stitch. It is not something we ever use in analysis for any of our workflows, but provides a nice overview of how well the stitching worked (which is a step that can generate major artifacts if not done well). I think we should include it because we already generated it, we are including it with all other PERISCOPE datasets that undergo stitching and cropping (all pooled datasets), and it is a simple but important QC checkmark.
However, if you think it is problematic to include anything extra, I defer to you.
Ok that is fine as long as the images are also "unadjusted" as for what we use for the input to all benchmarking experiments.
Yes, these stitches coming from images_aligned_stitched/images_corrected_stitched_10X
directly correspond to the images_aligned_stitched/images_corrected_cropped
(To be totally clear, we do have images_corrected_stitched_10X
and images_corrected_cropped
folders that are within the 2018_11_20_Periscope_Calico/20210124_6W_CP228/
folder and those are NOT what we are syncing. Those are the corrected images used for traditional PERISCOPE analysis. The images_aligned_stitched/images_corrected_stitched_10X
and images_aligned_stitched/images_corrected_cropped
folders are the aligned but NOT corrected images - the images_corrected_stitched
prefix was hardcoded into our workflow so I nested it into the images_aligned_stitched
folder when I generated the images specific to Marzieh's project. Please note that the sync commands that I suggested above simplify the odd naming/nesting induced by the hardcoding in our workflow and make it consistent with the naming conventions of the project (while still indicating that this is a data level not usually produced.)
I suggest these two prefixes instead:
Thanks! Would you be able to make this update in the bucket policy @ErinWeisbart ? I don't have access, and you if don't either, then we will need to ping Beth
Would you be able to make this update in the bucket policy
done
@MarziehHaghighi
There is some issue with the Overlay
paths:
This path does not exist
aws s3 ls s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/CorrDNA_Site_12_Overlay.png
This is the listing one level up:
aws s3 ls s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/|head
PRE CP228A-Well1-1/
PRE CP228A-Well1-10/
PRE CP228A-Well1-100/
PRE CP228A-Well1-11/
PRE CP228A-Well1-12/
PRE CP228A-Well1-13/
PRE CP228A-Well1-14/
PRE CP228A-Well1-15/
PRE CP228A-Well1-16/
PRE CP228A-Well1-17/
Can you fix the paths and ping me?
That path you have listed should not exist. All analysis
data is saved into a per-site folder.
I don't know what your sync command is currently set up as, but if you want to sync just overlays, you'll need the right include/exclude flags like
aws s3 sync s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/ DESTINATION --exclude "*" --include "*_Overlay.png"
@Erin Oh nice that is convenient!
@shntnu sorry for that here is the corrected version:
overlay_transfer_commands_updated.txt
You may want to use Erin's suggestion instead though.
You may want to use Erin's suggestion instead though.
@MarziehHaghighi So I do this, correct?
aws s3 sync \
s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/ \
s3://cellpainting-gallery/cpg0021-periscope/broad/workspace/analysis/20210124_6W_CP228/ \
--exclude "*" \
--include "*_Overlay.png"
@shntnu could you please transfer Nuclei.csv files from the analysis folder as well? Thanks.
aws s3 sync \
s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/ \
s3://cellpainting-gallery/cpg0021-periscope/broad/workspace/analysis/20210124_6W_CP228/ \
--exclude "*" \
--include "Nuclei.csv"
@MarziehHaghighi bumping this back to you now that you are all set up to directly copy data over
Marzieh is working on the following:
aws s3 sync \
s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/20210124_6W_CP228/images_aligned_stitched/images_corrected_cropped \
s3://cellpainting-gallery/cpg0021-periscope/broad/images/20210124_6W_CP228/images_aligned_cropped --profile gallery
[x] Unarchive the analysis as it is now archived (the req is sent so I can to the transfer tomorrow)
[x] transfer Nuclei.csv files
aws s3 sync \
s3://pooled-cell-painting/projects/2018_11_20_Periscope_Calico/workspace/analysis/20210124_6W_CP228/ \
s3://cellpainting-gallery/cpg0021-periscope/broad/workspace/analysis/20210124_6W_CP228/ \
--exclude "*" \
--include "*/Nuclei.csv" --profile gallery
Update: I confirm the transfers are complete now (August 9, 2023)
[Link to publication repo]
cellpainting-gallery identifier = cpg0021-periscope
Transfer to CellPainting Gallery:
If data is being published, prepare for publication:
Once published: