Closed gwaybio closed 3 years ago
The process is the same as outlined here https://github.com/broadinstitute/lincs-cell-painting/issues/2#issuecomment-588323507 except that you stop at step 4 ("Delete the files from EBS"; the compression related comments are not relevant)
and for our internal (imaging platform) notes: I made some notes here about storage policy related to this question
thanks @shntnu - these instructions are only slightly different than the one's @hkhawar sent me on slack. Using both sets of instructions, and running the following command on 1 example plate, I receive the following error (reproduced below).
Is there something obvious that I'm doing wrong, or, is there a quick fix? If not, I will keep digging.
(cell-health) ubuntu@ip-10-0-9-22:~/efs/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/workspace/software/imaging-backup-scripts$ parallel \
> --results restore \
> -a list_of_plates.txt \
> ./glacier_restore.sh \
> --project_name ${PROJECT_NAME} \
> --batch_id ${BATCH_ID} \
> --plate_id {1} \
> --get_images
Get images ...
Download:s3://imaging-platform-cold/imaging_analysis/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/plates/2015_07_01_Cell_Health_Vazquez_Cancer_Broad_CRISPR_PILOT_B1_SQ00014610_images_illum_analysis.tar.gz
An error occurred (NoSuchKey) when calling the RestoreObject operation: The specified key does not exist.
An error occurred (404) when calling the HeadObject operation: Not Found
Ah – looks like Cell Health images were never archived, so I think you are all set!
s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images
For our notes: this dataset cost ~$72/mo to store and so it went down in the priority list. So glad we have a new process in place now that doesn't rely on running this archival step!
~$ aws s3 ls --recursive s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images|grep tiff > ~/Desktop/CRISPR_PILOT_B1.txt
~$ wc -l ~/Desktop/CRISPR_PILOT_B1.txt
155520 /Users/shsingh/Desktop/CRISPR_PILOT_B1.txt
155520 = 384 wells 9 sites 5 channels 3 cell lines 3 replicates, so this looks good
Thanks @shntnu ! This process is new to me so thanks for bearing with me :)
In chatting with @hkhawar about the file structure, is that number slightly concerning? i.e. do we have the illumination corrected images? The load_csv
file seems to indicate illumination correction was performed.
It seems like load_data_csv folder on S3 showed that load_data_with_illum.csv file was created but there are no illum folder containing illum files on S3
On Tue, Feb 25, 2020 at 4:38 PM Greg Way notifications@github.com wrote:
Thanks @shntnu https://github.com/shntnu ! This process is new to me so thanks for bearing with me :)
In chatting with @hkhawar https://github.com/hkhawar about the file structure, is that number slightly concerning? i.e. do we have the illumination corrected images? The load_csv file seems to indicate illumination correction was performed.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106?email_source=notifications&email_token=AIUGCWK2MZGL56VKBSKAHODREWFTTA5CNFSM4K3RAT72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM5TH3I#issuecomment-591082477, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWPS57ZDABL3FVFAQCLREWFTTANCNFSM4K3RAT7Q .
When you inspect the path of illum files, you will find that they are nested inside the analysis
folder – this was our previous standard (we later changed to storing illum functions with images)
I bet you will find them at that location
read_csv("../workspace/load_data_csv/CRISPR_PILOT_B1/SQ00014610/load_data_with_illum.csv", col_types = cols(.default = col_character())) %>% slice(1) %>% select(matches("^PathName_Illum")) %>% pivot_longer(cols = everything()) %>% knitr::kable()
name | value |
---|---|
PathName_IllumAGP | /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ |
PathName_IllumDNA | /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ |
PathName_IllumER | /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ |
PathName_IllumMito | /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ |
PathName_IllumRNA | /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ |
👋 shantanu
On Tue, Feb 25, 2020 at 4:51 PM Shantanu Singh notifications@github.com wrote:
When you inspect the path of illum files, you will find that they are nested inside the analysis folder – this was our previous standard (we later changed to storing illum functions with images)
I bet you will find them at that location
read_csv("../workspace/load_data_csv/CRISPR_PILOT_B1/SQ00014610/load_data_with_illum.csv", col_types = cols(.default = col_character())) %>% slice(1) %>% select(matches("^PathName_Illum")) %>% pivot_longer(cols = everything()) %>% knitr::kable()
name value PathName_IllumAGP /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ PathName_IllumDNA /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ PathName_IllumER /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ PathName_IllumMito /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ PathName_IllumRNA /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106?email_source=notifications&email_token=AIUGCWPGTPS6FQCE3SYRE7LREWHFFA5CNFSM4K3RAT72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM5UVZQ#issuecomment-591088358, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWNUL3VK35B2OB7IA7DREWHFFANCNFSM4K3RAT7Q .
do we have the illumination corrected images?
@gwaygenomics note that only illumination correction functions are stored, not the corrected images themselves.
In https://idr.openmicroscopy.org/webclient/?show=screen-1751, we decided to additionally store the illumination corrected images, so you would need to generate those separately if you decide to do that here
@gwaygenomics Some history of why we have illum corrected images in https://idr.openmicroscopy.org/webclient/?show=screen-1751:
And some more email logs from the time we submitted https://idr.openmicroscopy.org/webclient/?show=screen-1751
Great! Thanks for providing this context @shntnu - I'd like to include both raw and illumination corrected images.
I see the illumination correction functions (.mat
files), but I will need help applying them.
Greg if you want, I can help you out in getting the illumination corrected images
On Wed, Feb 26, 2020 at 4:40 PM Greg Way notifications@github.com wrote:
Great! Thanks for providing this context @shntnu https://github.com/shntnu - I'd like to include both raw and illumination corrected images.
I see the illumination correction functions (.mat files), but I will need help applying them.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106?email_source=notifications&email_token=AIUGCWNUPSMUG6L2LM54IW3RE3OWBA5CNFSM4K3RAT72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENB73IY#issuecomment-591658403, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWMHCY7C2MLVU7LRITTRE3OWBANCNFSM4K3RAT7Q .
@hkhawar - yes please! I will find a time on your calendar for a quick meeting
Sure
On Wed, Feb 26, 2020 at 4:54 PM Greg Way notifications@github.com wrote:
@hkhawar https://github.com/hkhawar - yes please! I will find a time on your calendar for a quick meeting
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106?email_source=notifications&email_token=AIUGCWLJVCZWOPFKIOE27X3RE3QLFA5CNFSM4K3RAT72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENCBIIA#issuecomment-591664160, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWOIVNAGCF4WEBLAIB3RE3QLFANCNFSM4K3RAT7Q .
Hamdah reprocessed some illum corrected files that were corrected and stored them in folders like this
s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp/SQ00014610/illum_corrected/
I am now going to copy these to their corresponding original locations e.g. here
s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014610/Images/
using this command
origpath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images
temppath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp
# copy all files (the ones in the temppath will fail)
parallel \
--header ".*\n" \
-C "," \
-a corrupted_image.csv \
aws s3 cp ${temppath}/{1}/illum_corrected/{2} ${origpath}/{1}/Images/{2}
corrupted_image.csv
is available here
This step revealed that some files were missing in the tmp
folder:
parallel \
--header ".*\n" \
-C "," \
-a corrupted_image.csv \
"if ! aws s3 ls ${temppath}/{1}/illum_corrected/{2} > /dev/null; then echo Temp path - {1}/{2} missing; fi"
Temp path - SQ00014613/r07c21f05p01-ch2sk1fk1fl1.tiff missing
Temp path - SQ00014613/r06c04f05p01-ch5sk1fk1fl1.tiff missing
Temp path - SQ00014613/r10c08f05p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014613/r08c19f04p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014613/r02c08f08p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014610/r02c13f02p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014610/r16c19f02p01-ch3sk1fk1fl1.tiff missing
Temp path - SQ00014610/r07c07f03p01-ch2sk1fk1fl1.tiff missing
Temp path - SQ00014614/r09c07f01p01-ch5sk1fk1fl1.tiff missing
thank you Shantanu ❤️ (and Hamdah too for the upfront processing)
Steps to perform once the missing files listed at the end of https://github.com/broadinstitute/cell-health/issues/106#issuecomment-663154084 are recreated
temppath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp
parallel \
--header ".*\n" \
-C "," \
-a corrupted_image.csv \
"if ! aws s3 ls ${temppath}/{1}/illum_corrected/{2} > /dev/null; then echo Temp path - {1}/{2} missing; fi"
origpath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images
temppath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp
# copy all files (the ones missing in the temppath will fail)
parallel \
--header ".*\n" \
-C "," \
-a corrupted_image.csv \
aws s3 cp ${temppath}/{1}/illum_corrected/{2} ${origpath}/{1}/Images/{2}
parallel \
mkdir -p illumcorrected_CRISPR_PILOT_B1/images/{1} ::: SQ00014610 SQ00014611 SQ00014612 SQ00014613 SQ00014614 SQ00014615 SQ00014616 SQ00014617 SQ00014618
origpath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images
parallel \
--header ".*\n" \
-C "," \
-a corrupted_image.csv \
aws s3 cp ${origpath}/{1}/Images/{2} illumcorrected_CRISPR_PILOT_B1/images/{1}/Images/{2}
brew install imagemagick
to do a quick test of fidelity after downloadingparallel \
--header ".*\n" \
-C "," \
-a corrupted_image.csv \
identify illumcorrected_CRISPR_PILOT_B1/images/{1}/Images/{2} | grep "Can not read TIFF"
aws s3 ls --recursive s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images |grep tiff > /tmp/image_files.txt
# get file sizes and counts
cat /tmp/image_files.txt |tr -s " "|cut -d" " -f3|sort -n|uniq -c
Once you've confirmed everything works, you can have IDR run step 3 at their end.
Corrected images are in the separate tmp folder on S3 platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp/ I didn't replace those one in original illumcorrected_CRISPR_PILOT_B1 folder
On Thu, Jul 23, 2020 at 1:43 PM Shantanu Singh notifications@github.com wrote:
Steps
- brew install imagemagick if you want to do a quick test of fidelity after downloading using identify
- Download copy_illumcorrected_CRISPR_PILOT_B1.sh.txt https://github.com/broadinstitute/cell-health/files/4968076/copy_illumcorrected_CRISPR_PILOT_B1.sh.txt and rename to .sh
- chmod +x copy_illumcorrected_CRISPR_PILOT_B1.sh
- run it ./copy_illumcorrected_CRISPR_PILOT_B1.sh
I noticed two issues:
- The first file does not exist i.e. SQ00014613/Images/r06c04f05p01-ch5sk1fk1fl1.tiff.
- I ran find illumcorrected_CRISPR_PILOT_B1 -name "*.tiff" -exec identify {} \; 2>&1 >/tmp/foo; grep "Can not read " /tmp/foo and found that some files are still corrupted but I think those were never recreated in the first place.
@gwaygenomics https://github.com/gwaygenomics you'd want to repeat these steps yourself and then report back which illumination-corrected files will need to be recreated. The current list is
Missing:
SQ00014613/Images/r06c04f05p01-ch5sk1fk1fl1.tiff
Still corrupted (maybe never recreated?):
SQ00014610/Images/r02c13f02p01-ch1sk1fk1fl1.tiff SQ00014610/Images/r07c07f03p01-ch2sk1fk1fl1.tiff SQ00014610/Images/r16c19f02p01-ch3sk1fk1fl1.tiff SQ00014613/Images/r02c08f08p01-ch1sk1fk1fl1.tiff SQ00014613/Images/r07c21f05p01-ch2sk1fk1fl1.tiff SQ00014613/Images/r08c19f04p01-ch1sk1fk1fl1.tiff SQ00014613/Images/r10c08f05p01-ch1sk1fk1fl1.tiff
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106#issuecomment-663170068, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWOLXTL2EMD7CYNGS6DR5CAFXANCNFSM4K3RAT7Q .
I didn't replace those one in original illumcorrected_CRISPR_PILOT_B1
Yep, but I did (see comments). I will update this thread once I've figured out the issue – might be something else driving this.
Steps to perform once the missing files listed at the end of #106 (comment) are recreated
For my understanding, is this the complete order of operations?
@hkhawar can you help with step 1 above?
Thanks again Shantanu and Hamdah!
@gwaygenomics Do I need to process only following nine files? Temp path - SQ00014613/r07c21f05p01-ch2sk1fk1fl1.tiff missing Temp path - SQ00014613/r06c04f05p01-ch5sk1fk1fl1.tiff missing Temp path - SQ00014613/r10c08f05p01-ch1sk1fk1fl1.tiff missing Temp path - SQ00014613/r08c19f04p01-ch1sk1fk1fl1.tiff missing Temp path - SQ00014613/r02c08f08p01-ch1sk1fk1fl1.tiff missing Temp path - SQ00014610/r02c13f02p01-ch1sk1fk1fl1.tiff missing Temp path - SQ00014610/r16c19f02p01-ch3sk1fk1fl1.tiff missing Temp path - SQ00014610/r07c07f03p01-ch2sk1fk1fl1.tiff missing Temp path - SQ00014614/r09c07f01p01-ch5sk1fk1fl1.tiff missing
I am also concerned some of the files that IDR has not listed as corrupted are actually corrupted. E.g. this one
2020-03-08 10:41:41 743346 projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014615/Images/r02c08f03p01-ch5sk1fk1fl1.tiff`
I downloaded it like this
aws s3 cp s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014615/Images/r02c08f03p01-ch5sk1fk1fl1.tiff .
identify
did not report issues
identify ./r02c08f03p01-ch5sk1fk1fl1.tiff
./r02c08f03p01-ch5sk1fk1fl1.tiff TIFF 2160x2160 2160x2160+0+0 16-bit Grayscale Gray 743346B 0.000u 0:00.000
But I'm not able to open the file using Preview
("It may be damaged or use a file format that Preview doesn’t recognize.")
My suspicion is that all the files with infrequent file sizes are actually corrupted files.
Welcome to the rabbit hole! :)
Get the file listing
aws s3 ls --recursive s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images |grep tiff > /tmp/image_files.txt
Now download the files that whose files sizes are infrequent:
library(tidyverse)
sizes <-
read_delim("/tmp/image_files.txt",
col_names = c("date", "time", "size", "path"),
trim_ws = TRUE,
delim = " ") %>%
mutate(download = sprintf("aws s3 cp s3://imaging-platform/%s %s", path, path)) %>%
mutate(dirpath = dirname(path))
dirpaths <-
sizes %>%
distinct(dirpath)
dirpaths$dirpath %>%
walk(function(dirpath) dir.create(dirpath, showWarnings = FALSE, recursive = TRUE))
frac_sizes <-
sizes %>%
group_by(size) %>%
tally() %>%
arrange(desc(size)) %>%
mutate(frac = n / sum(n))
frac_sizes %>%
head() %>%
knitr::kable()
frac_sizes %>%
filter(frac < 0.001) %>%
select(size) %>%
inner_join(sizes) %>%
magrittr::extract2("download") %>%
walk(function(download) system(download))
I ran that and then did a random sampling of images by trying to open using Preview
and found that all in that random sample were corrupted. This is the full list of all files downloaded (below).
@gwaygenomics I gotta run but hopefully, you can take it from here and figure out the next steps. If not, ping me on this and I'll have a look once back from vacation
projects/
└── 2015_07_01_Cell_Health_Vazquez_Cancer_Broad
└── illumcorrected_CRISPR_PILOT_B1
└── images
├── SQ00014610
│ └── Images
│ ├── r01c18f01p01-ch4sk1fk1fl1.tiff
│ ├── r01c19f08p01-ch5sk1fk1fl1.tiff
│ ├── r02c07f06p01-ch2sk1fk1fl1.tiff
│ ├── r02c13f02p01-ch1sk1fk1fl1.tiff
│ ├── r04c01f01p01-ch5sk1fk1fl1.tiff
│ ├── r07c07f03p01-ch2sk1fk1fl1.tiff
│ ├── r10c12f05p01-ch2sk1fk1fl1.tiff
│ ├── r13c03f08p01-ch1sk1fk1fl1.tiff
│ ├── r13c09f01p01-ch2sk1fk1fl1.tiff
│ ├── r16c19f02p01-ch3sk1fk1fl1.tiff
│ └── r16c20f07p01-ch4sk1fk1fl1.tiff
├── SQ00014611
│ └── Images
│ ├── r02c18f03p01-ch1sk1fk1fl1.tiff
│ ├── r06c11f02p01-ch2sk1fk1fl1.tiff
│ └── r14c08f07p01-ch5sk1fk1fl1.tiff
├── SQ00014612
│ └── Images
│ ├── r03c08f01p01-ch4sk1fk1fl1.tiff
│ ├── r06c06f08p01-ch5sk1fk1fl1.tiff
│ ├── r10c15f07p01-ch1sk1fk1fl1.tiff
│ ├── r11c05f02p01-ch5sk1fk1fl1.tiff
│ └── r13c08f06p01-ch4sk1fk1fl1.tiff
├── SQ00014613
│ └── Images
│ ├── r02c08f08p01-ch1sk1fk1fl1.tiff
│ ├── r03c15f04p01-ch4sk1fk1fl1.tiff
│ ├── r07c05f02p01-ch1sk1fk1fl1.tiff
│ ├── r07c21f05p01-ch2sk1fk1fl1.tiff
│ ├── r08c19f04p01-ch1sk1fk1fl1.tiff
│ ├── r10c08f05p01-ch1sk1fk1fl1.tiff
│ └── r11c18f08p01-ch2sk1fk1fl1.tiff
├── SQ00014614
│ └── Images
│ ├── r03c04f01p01-ch4sk1fk1fl1.tiff
│ ├── r03c07f05p01-ch5sk1fk1fl1.tiff
│ ├── r05c09f08p01-ch1sk1fk1fl1.tiff
│ ├── r09c07f01p01-ch5sk1fk1fl1.tiff
│ └── r15c02f03p01-ch4sk1fk1fl1.tiff
├── SQ00014615
│ └── Images
│ ├── r02c08f03p01-ch5sk1fk1fl1.tiff
│ ├── r02c14f04p01-ch1sk1fk1fl1.tiff
│ ├── r08c07f01p01-ch3sk1fk1fl1.tiff
│ ├── r08c07f07p01-ch5sk1fk1fl1.tiff
│ ├── r08c14f07p01-ch1sk1fk1fl1.tiff
│ ├── r09c07f03p01-ch1sk1fk1fl1.tiff
│ ├── r10c09f08p01-ch5sk1fk1fl1.tiff
│ ├── r10c18f03p01-ch2sk1fk1fl1.tiff
│ ├── r13c21f07p01-ch2sk1fk1fl1.tiff
│ ├── r15c15f08p01-ch4sk1fk1fl1.tiff
│ └── r16c21f05p01-ch1sk1fk1fl1.tiff
├── SQ00014616
│ └── Images
│ ├── r01c17f07p01-ch5sk1fk1fl1.tiff
│ ├── r02c21f01p01-ch1sk1fk1fl1.tiff
│ ├── r03c19f02p01-ch5sk1fk1fl1.tiff
│ ├── r07c04f03p01-ch1sk1fk1fl1.tiff
│ └── r14c17f03p01-ch2sk1fk1fl1.tiff
├── SQ00014617
│ └── Images
│ ├── r02c23f05p01-ch1sk1fk1fl1.tiff
│ ├── r03c06f02p01-ch4sk1fk1fl1.tiff
│ ├── r06c01f08p01-ch4sk1fk1fl1.tiff
│ ├── r06c16f02p01-ch2sk1fk1fl1.tiff
│ ├── r08c16f07p01-ch4sk1fk1fl1.tiff
│ ├── r11c14f07p01-ch2sk1fk1fl1.tiff
│ ├── r12c04f02p01-ch3sk1fk1fl1.tiff
│ ├── r12c08f04p01-ch4sk1fk1fl1.tiff
│ ├── r12c10f04p01-ch5sk1fk1fl1.tiff
│ ├── r13c09f07p01-ch4sk1fk1fl1.tiff
│ └── r15c14f04p01-ch2sk1fk1fl1.tiff
└── SQ00014618
└── Images
├── r01c14f08p01-ch1sk1fk1fl1.tiff
├── r03c09f07p01-ch1sk1fk1fl1.tiff
├── r03c09f07p01-ch5sk1fk1fl1.tiff
├── r03c12f06p01-ch4sk1fk1fl1.tiff
├── r05c10f08p01-ch5sk1fk1fl1.tiff
├── r06c01f07p01-ch1sk1fk1fl1.tiff
├── r07c09f07p01-ch1sk1fk1fl1.tiff
├── r13c05f04p01-ch5sk1fk1fl1.tiff
├── r14c10f02p01-ch3sk1fk1fl1.tiff
└── r16c23f01p01-ch2sk1fk1fl1.tiff
@gwaygenomics I just saw https://github.com/broadinstitute/cell-health/issues/106#issuecomment-663217930
Yes, that's the right order of operations.
But @hkhawar, unfortunately, you will also need to reprocess those files listed at the end of https://github.com/broadinstitute/cell-health/issues/106#issuecomment-663224408 because my random sampling revealed that those are also corrupted. I have no clue why so many files are getting corrupted but hopefully you will figure that out.
@hkhawar Thanks very much for helping out!
@hkhawar one more thing – could you please briefly describe the setup you are using to reprocess these images? Are you mounting the S3 bucket on your computer and running it on your computer by any chance? If so, I think that could be the issue because S3 mounts suck with heavy I/O.
@shntnu I ran this experiment on AWS. I am not sure why we have gotten lot of corrupted images. I could guess something happened during running DCP and instead of ending up in dead message queues for unfinished jobs. They somehow created an image file with 0 Bytes
@hkhawar Thanks for clarifying. Very strange! And note that the issue is that some output files are actually pretty large e.g. 8Mb but are still corrupted. Worth checking in with Beth on this via Slack.
@gwaygenomics Could you please do the same thing that you did before Sorting other channels for these images?
@shntnu Sure I will check with Beth on this tomorrow
Here's an example: r06c11f02p01-ch2sk1fk1fl1.tiff.zip located at projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014611/Images/r06c11f02p01-ch2sk1fk1fl1.tiff
It doesn't open using Preview:
But it does open in Fiji but the bottom pixels are missing
I've posted this internally https://broadinstitute.slack.com/archives/G3QFDHXC4/p1595538827014000
@gwaygenomics if you can sort other channels for the corrupted files for me as you did last time. Then I will reprocess them today?
@gwaygenomics if you can sort other channels for the corrupted files for me as you did last time. Then I will reprocess them today?
Sure - what folder do you want them in? Also, do you think reprocessing them the same way as before is a good idea? (are you going to do anything different?)
I am doing it locally. Just make a tmp2 folder on S3 and dump new set of images for each plate? Later we delete these tmp folders from S3
I am doing it locally. Just make a tmp2 folder on S3 and dump new set of images for each plate? Later we delete these tmp folders from S3
For our notes, could you pen down why they need to be in a new folder (vs creating a loaddata file pointing to the original locations?) Will be useful to know when we need to reprocess small batches
I was avoiding to use load_data.csv and wanted to download images locally and using CellProfiler locally to reprocess files. This is how I typically do for small set of images.
Occasionally, CellProfiler just stochastically seems to do this- any operation, even write or sync, will sometimes stochastically just go ker-flop, and when we're working on 10K/100K/1M/10M images, the likelihood it will happen >=1 times becomes significant. Since each plate has ~21K images, based on the list above, the likelihood is in the 1-to-low-thousands.
If there's a problem with the source image, obviously that's one thing; if the problem is truly stochastic (aka when you run the same image again the output file comes out fine), there isn't a ton to do (though if these were done <60 days ago it's worth checking the logs for the known bad sites since that's easy while the logs are still in CloudWatch). If we think the file is being written correctly, but not synced correctly, we could always institute a 30 or 60 second pause after the CellProfiler pipeline is done before syncing.
It's worth noting we can very easily handle the ones where files are small (obviously corrupted) using the MIN_FILE_SIZE option I added to DCP by just resubmitting the whole batch with CHECK_IF_DONE set to TRUE and MIN_FILE_SIZE set small- anything with the right number of files > a certain size will just get skipped, and it will re-process just the ones where 1+ file is tiny. If either the uncorrupted OR corrupted files have a stereotyped size, which Shantanu your methodology seems to imply, you could imagine other similar checks we could add; essentially either
if filesize in accepted_file_sizes:
goodfile_count +=1
if goodfile_count >= N:
reprocess = False
or
if filesize not in known_bad_file_sizes:
goodfile_count +=1
if goodfile_count >= N:
reprocess = False
@hkhawar
the corrupted files are ready to go! located at /home/ubuntu/bucket/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp_version_two
thanks!
@bethac07 Logs are not available now. I guess is that problem happened during syncing of files. On redoing reprocessing those images again just worked fine @gwaygenomics thanks I am going to work on it
It's worth noting we can very easily handle the ones where files are small (obviously corrupted) using the MIN_FILE_SIZE option I added to DCP by just resubmitting the whole batch with CHECK_IF_DONE set to TRUE and MIN_FILE_SIZE set small- anything with the right number of files > a certain size will just get skipped, and it will re-process just the ones where 1+ file is tiny. If either the uncorrupted OR corrupted files have a stereotyped size, which Shantanu your methodology seems to imply, you could imagine other similar checks we could add; essentially either
Thanks for clarifying @bethac07 🥇.
@hkhawar details are below but tl;dr: we could have gone with fixed file size because these are uncompressed TIFFS so I think they should all be the same file size. But there's one aberration (below). So instead let's go with CHECK_IF_DONE=TRUE
and MIN_FILE_SIZE = 9348718
.
Details
I dug into this a bit for our future reference with this kind of issue.
frac_sizes %>% head() %>% knitr::kable()
From this table, looks like 9348786
is the value to go with. But I don't know what's happening with 9348718
– why are there 1240 instances of that? No clue. Also, files with size 9348718
open fine with Preview.
size | n | frac |
---|---|---|
9348786 | 136926 | 0.9905378 |
9348718 | 1240 | 0.0089703 |
9210546 | 1 | 0.0000072 |
8795826 | 1 | 0.0000072 |
8683506 | 1 | 0.0000072 |
8631666 | 1 | 0.0000072 |
All other sizes have only 1-2 occurrences (except 8
which occurs 8 times).
frac_sizes %>% filter(size < 9348718) %>% count(n) %>% knitr::kable()
n | nn |
---|---|
1 | 56 |
2 | 2 |
8 | 1 |
93487181
is certainly special because if any one channel of a site has that value, then all channels have that value
sizes %>% filter(size == 9348718) %>% mutate(site = basename(path), plate = str_match(dirpath, "SQ[0-9]{8}")) %>% separate(site, c("site", "channel"), sep = "-") %>% group_by(site, plate) %>% tally() %>% ungroup() %>% arrange(site) %>% count(n)
n | nn |
---|---|
5 | 248 |
@gwaygenomics I have reprocessed illum corrected images and they are available in the same folder
s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp_version_two/
Note: I haven't synced them to s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/
I guess you can do it
@shntnu I have no idea why we get images of this 93487181 size. Did you try opening an image of this size in Fiji?
Download integrity confirmed! This is the output of the R code in https://github.com/broadinstitute/cell-health/issues/106#issuecomment-663224408:
size | n | frac |
---|---|---|
9348786 | 136926 | 0.9905306 |
9348718 | 1309 | 0.0094694 |
🎉
All that remains is to send IDR the S3 links
One potentially interesting observation is that all of the corrupted files that we needed to fix ended up having the smaller file size listed above.
next hurdle incoming!
IDR has all non-illumination corrected images, but they are missing 1,925 illumination corrected images.
The folks at IDR are working towards verifying the submission. A couple of points that either @hkhawar or @shntnu might know the answer to right away.
f09
in their name are missing from the illumination corrected set (there are 1920 of these).f09
Here are example images:
r16c24f09p01-ch2sk1fk1fl1.tiff
r16c24f09p01-ch3sk1fk1fl1.tiff
r16c24f09p01-ch4sk1fk1fl1.tiff
r16c24f09p01-ch5sk1fk1fl1.tiff
r16c24f01p01-ch1sk1fk1fl1.tiff
r16c24f01p01-ch2sk1fk1fl1.tiff
r16c24f01p01-ch3sk1fk1fl1.tiff
r16c24f01p01-ch4sk1fk1fl1.tiff
r16c24f01p01-ch5sk1fk1fl1.tiff
@Gregory Way gway@broadinstitute.org its a huge pain. Again I think it is related to same problem not transferring them to S3 properly and produced corrupted and missing image files. if they provided us a list of missing illum images then I have to redo it again
On Tue, Aug 25, 2020 at 2:42 PM Greg Way notifications@github.com wrote:
next hurdle incoming! Summary
IDR has all non-illumination corrected images, but they are missing 1,925 illumination corrected images. Specifics
The folks at IDR are working towards verifying the submission. A couple of points that either @hkhawar https://github.com/hkhawar or @shntnu https://github.com/shntnu might know the answer to right away.
- Images with f09 in their name are missing from the illumination corrected set (there are 1920 of these).
- There are 5 additional images missing in the illumination corrected set all from plate SQ00014610
Issue 1 - Missing f09
Here are example images:
r16c24f09p01-ch2sk1fk1fl1.tiff r16c24f09p01-ch3sk1fk1fl1.tiff r16c24f09p01-ch4sk1fk1fl1.tiff r16c24f09p01-ch5sk1fk1fl1.tiff
Issue 2 - Five more
r16c24f01p01-ch1sk1fk1fl1.tiff r16c24f01p01-ch2sk1fk1fl1.tiff r16c24f01p01-ch3sk1fk1fl1.tiff r16c24f01p01-ch4sk1fk1fl1.tiff r16c24f01p01-ch5sk1fk1fl1.tiff
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106#issuecomment-680231705, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWKWXJQLADG6RA2HDBDSCQHZNANCNFSM4K3RAT7Q .
Argh! Is there something that I can do to ease the pain? Transfer files into a new folder again? It seems like this is an AWS transfer issue?
Yup that would be a great help. Let me know once they are done. I will be work on it.
On Tue, Aug 25, 2020 at 3:06 PM Greg Way notifications@github.com wrote:
Argh! Is there something that I can do to ease the pain? Transfer files into a new folder again? It seems like this is an AWS transfer issue?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106#issuecomment-680242919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWJFPJQDDKMPP4J6WOTSCQKV7ANCNFSM4K3RAT7Q .
turns out that we actually have 17,285 illum corrected files missing.
1,920 "f09" files missing per plate 9 plates 5 "f01" files missing only in plate SQ00014610 1,920 * 9 + 5 = 17,285
Transfer files into a new folder again?
Yup that would be a great help. Let me know once they are done.
I have confirmed that all of these files are now in a separate folder. The folder is/home/ubuntu/bucket/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp_version_three
.
Note that the subfolder is tmp_version_three.
@hkhawar all set for the next (and hopefully final!) iteration of the illum correction pipeline. Thanks again
Uploading the images into the public domain is a very important part of the research process. I will upload image files to the Image Data Resource and add URL and metadata information to the Broad Bioimage Benchmark Collection.
We will use this issue to outline the required steps. First, I will need to restore the image files from aws glacier storage. @shntnu - can you link the most recent resources?
edit
data how available: https://idr.openmicroscopy.org/webclient/?show=screen-2701