broadinstitute / cell-health

Predicting Cell Health with Morphological Profiles
MIT License
35 stars 8 forks source link

Uploading Image Files to IDR and BBBC #106

Closed gwaybio closed 3 years ago

gwaybio commented 4 years ago

Uploading the images into the public domain is a very important part of the research process. I will upload image files to the Image Data Resource and add URL and metadata information to the Broad Bioimage Benchmark Collection.

We will use this issue to outline the required steps. First, I will need to restore the image files from aws glacier storage. @shntnu - can you link the most recent resources?


edit

data how available: https://idr.openmicroscopy.org/webclient/?show=screen-2701

shntnu commented 4 years ago

The process is the same as outlined here https://github.com/broadinstitute/lincs-cell-painting/issues/2#issuecomment-588323507 except that you stop at step 4 ("Delete the files from EBS"; the compression related comments are not relevant)

shntnu commented 4 years ago

and for our internal (imaging platform) notes: I made some notes here about storage policy related to this question

gwaybio commented 4 years ago

thanks @shntnu - these instructions are only slightly different than the one's @hkhawar sent me on slack. Using both sets of instructions, and running the following command on 1 example plate, I receive the following error (reproduced below).

Is there something obvious that I'm doing wrong, or, is there a quick fix? If not, I will keep digging.

Command

(cell-health) ubuntu@ip-10-0-9-22:~/efs/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/workspace/software/imaging-backup-scripts$ parallel \
>   --results restore \
>   -a list_of_plates.txt \
>   ./glacier_restore.sh \
>   --project_name ${PROJECT_NAME} \
>   --batch_id ${BATCH_ID} \
>   --plate_id {1} \
>   --get_images

Error

Get images ...
Download:s3://imaging-platform-cold/imaging_analysis/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/plates/2015_07_01_Cell_Health_Vazquez_Cancer_Broad_CRISPR_PILOT_B1_SQ00014610_images_illum_analysis.tar.gz

An error occurred (NoSuchKey) when calling the RestoreObject operation: The specified key does not exist.

An error occurred (404) when calling the HeadObject operation: Not Found
shntnu commented 4 years ago

Ah – looks like Cell Health images were never archived, so I think you are all set!

s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images

For our notes: this dataset cost ~$72/mo to store and so it went down in the priority list. So glad we have a new process in place now that doesn't rely on running this archival step!

shntnu commented 4 years ago
~$ aws s3 ls --recursive s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/CRISPR_PILOT_B1/images|grep tiff  > ~/Desktop/CRISPR_PILOT_B1.txt
~$ wc -l ~/Desktop/CRISPR_PILOT_B1.txt
  155520 /Users/shsingh/Desktop/CRISPR_PILOT_B1.txt

155520 = 384 wells 9 sites 5 channels 3 cell lines 3 replicates, so this looks good

gwaybio commented 4 years ago

Thanks @shntnu ! This process is new to me so thanks for bearing with me :)

In chatting with @hkhawar about the file structure, is that number slightly concerning? i.e. do we have the illumination corrected images? The load_csv file seems to indicate illumination correction was performed.

hkhawar commented 4 years ago

It seems like load_data_csv folder on S3 showed that load_data_with_illum.csv file was created but there are no illum folder containing illum files on S3

On Tue, Feb 25, 2020 at 4:38 PM Greg Way notifications@github.com wrote:

Thanks @shntnu https://github.com/shntnu ! This process is new to me so thanks for bearing with me :)

In chatting with @hkhawar https://github.com/hkhawar about the file structure, is that number slightly concerning? i.e. do we have the illumination corrected images? The load_csv file seems to indicate illumination correction was performed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106?email_source=notifications&email_token=AIUGCWK2MZGL56VKBSKAHODREWFTTA5CNFSM4K3RAT72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM5TH3I#issuecomment-591082477, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWPS57ZDABL3FVFAQCLREWFTTANCNFSM4K3RAT7Q .

shntnu commented 4 years ago

When you inspect the path of illum files, you will find that they are nested inside the analysis folder – this was our previous standard (we later changed to storing illum functions with images)

I bet you will find them at that location

read_csv("../workspace/load_data_csv/CRISPR_PILOT_B1/SQ00014610/load_data_with_illum.csv", col_types = cols(.default = col_character())) %>% slice(1) %>% select(matches("^PathName_Illum")) %>% pivot_longer(cols = everything()) %>% knitr::kable()
name value
PathName_IllumAGP /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
PathName_IllumDNA /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
PathName_IllumER /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
PathName_IllumMito /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
PathName_IllumRNA /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/
hkhawar commented 4 years ago

👋 shantanu

On Tue, Feb 25, 2020 at 4:51 PM Shantanu Singh notifications@github.com wrote:

When you inspect the path of illum files, you will find that they are nested inside the analysis folder – this was our previous standard (we later changed to storing illum functions with images)

I bet you will find them at that location

read_csv("../workspace/load_data_csv/CRISPR_PILOT_B1/SQ00014610/load_data_with_illum.csv", col_types = cols(.default = col_character())) %>% slice(1) %>% select(matches("^PathName_Illum")) %>% pivot_longer(cols = everything()) %>% knitr::kable()

name value PathName_IllumAGP /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ PathName_IllumDNA /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ PathName_IllumER /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ PathName_IllumMito /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/ PathName_IllumRNA /home/ubuntu/bucket/projects/2015_07_01_KRAS_Vazquez_Cancer_Broad/workspace/analysis/CRISPR_PILOT_B1/SQ00014610/illum/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106?email_source=notifications&email_token=AIUGCWPGTPS6FQCE3SYRE7LREWHFFA5CNFSM4K3RAT72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM5UVZQ#issuecomment-591088358, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWNUL3VK35B2OB7IA7DREWHFFANCNFSM4K3RAT7Q .

shntnu commented 4 years ago

do we have the illumination corrected images?

@gwaygenomics note that only illumination correction functions are stored, not the corrected images themselves.

In https://idr.openmicroscopy.org/webclient/?show=screen-1751, we decided to additionally store the illumination corrected images, so you would need to generate those separately if you decide to do that here

shntnu commented 4 years ago

@gwaygenomics Some history of why we have illum corrected images in https://idr.openmicroscopy.org/webclient/?show=screen-1751:

Click to expand ``` Forwarded Conversation Subject: Depositing images to Image Data Resource ------------------------ From: Anne Carpenter Date: Wed, Dec 7, 2016 at 12:28 PM To: Shantanu Singh , Mohammad Hossein Rohban , Juan C Caicedo Mohammad will soon be receiving a hard drive to transfer images to IDR.  I want to discuss whether to upload illumination corrected images in addition to or instead of the raw images. And, outlines of segmented objects. I ask this because I clicked on the Wawer screen link http://idr-demo.openmicroscopy.org/webclient/?show=screen-1251 and clicked a few wells and some stains are very dramatically dimmer across the top. For deep learning purposes and/or for simple viewing it would be much nicer to have the corrected ones.  I guess it comes down to whether IDR's tools make it easy to ignore the corrected and/or non-corrected ones if you want, and whether IDR is not enthusiastic about doubling the size of the data for something simple like correction/not.  Secondly, I think we should routinely submit the nucleus outlines and cell outlines. These should be very small (binary) images so I don't think IDR would object to the size. The question for them is whether it's better to submit two binary images (one cell outlines and one nucleus outlines). The 2nd Q is whether they have any tools built in that can overlay these onto the intensity images. Unless someone has profound insights/discussion on this I would suggest the next step is for a volunteer to contact IDR and ask about all this.  Anne ---------- From: Shantanu Singh Date: Wed, Dec 7, 2016 at 1:17 PM To: Anne Carpenter Cc: Mohammad Hossein Rohban , Juan C Caicedo On Wed, Dec 7, 2016 at 12:27 PM, Anne Carpenter wrote: > Unless someone has profound insights/discussion on this I would suggest the > next step is for a volunteer to contact IDR and ask about all this. I agree Mohammad – given that you are our link to IDR at the moment, would you mind asking them this question? i.e. - We would like to upload both, raw images, and images corrected for illumination inhomogeneity - We would also like to upload  nucleus outlines and cell outlines for each image - Let us know how to go about this, and if you have any concerns about these extra image data ``` ``` Forwarded Conversation Subject: Re: Hard-drive for Target Accelerator ORF data to go in IDR ------------------------ From: Mohammad Hossein Rohban Date: Thu, Dec 8, 2016 at 10:39 AM To: Eleanor Williams Cc: Anne Carpenter , Shantanu Singh , Juan Caicedo Dear Eleanor, Thanks for sending us the hard drive. We were also wondering if we could also include illumination corrected images and images containing nucleus and cell outlines in addition to the original images (size will still be less than 1 TB). Please let us know how to go about this and if you have any concerns about these extra image data. Best, Mohammad ---------- From: Anne Carpenter Date: Thu, Dec 8, 2016 at 10:42 AM To: Mohammad Hossein Rohban Cc: Eleanor Williams , Shantanu Singh , Juan Caicedo In addition to just knowing whether it's acceptable to receive these extra images, we wondered whether the tools in IDR would allow for choosing which of 10 channels to display (5 original channels, 5 illumination corrected channels, 2 outline 'channels') and whether the outlines could be overlaid on intensity images. Anne ---------- From: Eleanor Williams Date: Fri, Dec 9, 2016 at 4:19 AM To: Anne Carpenter , Mohammad Hossein Rohban Cc: Shantanu Singh , Juan Caicedo , idr-submission@openmicroscopy.org.uk Hi Anne and MohammadPlease do add the illumination corrected images and images containing nucleus and cell outlines.  The IDR is set up so that it is possible to choose which of the channels to display.  I'm less sure about the overlaying of the outlines but we can certainly look into coming up with a solution to this. Best regardsEleanor ---------- From: Mohammad Hossein Rohban Date: Mon, Dec 12, 2016 at 5:48 PM To: Eleanor Williams Cc: Anne Carpenter , Shantanu Singh , Juan Caicedo , idr-submission@openmicroscopy.org.uk Hi Eleanor,  We are about to submit our images. The raw images are generally in the range of 0-2000 in an image file format that spans 0-4095. Does IDR have viewing options that allow the viewer to contrast-stretch the images if desired so they do not appear too dim? Or do you recommend we adjust the images on our side prior to submission (e.g. find the 99.9th percentile of maximum pixel value in all images and contrast-stretch all images to that value)? Thanks,Mohammad ---------- From: Eleanor Williams Date: Tue, Dec 13, 2016 at 5:24 AM To: Mohammad Hossein Rohban Cc: Anne Carpenter , Shantanu Singh , Juan Caicedo , idr-submission@openmicroscopy.org.uk Hi MohammadYes we can apply rendering settings in IDR.  We normally do this using a config file. The rendering setting are applied at the whole dataset level but it should be possible to apply at more specific levels too.  Here is an example of a config file for a screen https://github.com/IDR/idr-metadata/blob/master/idr0001-graml-sysgro/screenA/idr0001-screenA-renderdef.yml. If you could include a table of the settings you'd like to apply in each channel will create the config files. Best regardsEleanor ---------- From: Anne Carpenter Date: Tue, Dec 13, 2016 at 7:57 AM To: Eleanor Williams Cc: Mohammad Hossein Rohban , Shantanu Singh , Juan Caicedo , idr-submission@openmicroscopy.org.uk That is excellent. In that case we will send the raw images and the illumination corrected images, and allow the config file to adjust contrast if needed.Anne ```
shntnu commented 4 years ago

And some more email logs from the time we submitted https://idr.openmicroscopy.org/webclient/?show=screen-1751

Click to expand ``` Forwarded Conversation Subject: Re: [idr-submission] Hard-drive for Target Accelerator ORF data to go in IDR ------------------------ From: Anne Carpenter Date: Mon, Mar 6, 2017 at 9:22 AM To: Mohammad Hossein Rohban Cc: Shantanu Singh Great! SS or MB would be best suited to answer about mapping and scaling of channels. It's an important Q and will ease our future work if we get the right answer. On Mon, Mar 6, 2017 at 9:20 AM, Mohammad Hossein Rohban wrote: Hi Anne A DOI has been assigned for the TA ORF dataset at IDR, so we can now include it in eLife submission. They also asked (see forwarded email) for the rendering setting. Do we want to change assigned colors of the channels? —Mohammad Begin forwarded message: From: Eleanor Williams Subject: Re: [idr-submission] Hard-drive for Target Accelerator ORF data to go in IDR Date: March 3, 2017 at 6:14:30 PM EST To: Mohammad Hossein Rohban , "idr-submission@openmicroscopy.org.uk" Reply-To: , I forgot that I wanted to ask you about rendering settings.  At the moment there is a green channel, a red channel and 3 blue channels, in both the raw image and illumination corrected images (screenshots attached).   Would you like to change the color of some of the channels and is there a particular max and min value you'd like applied across all plates (or all raw and all illumination corrected plates)?Best regardsEleanor  On 03/03/2017 22:52, Eleanor Williams wrote: Hi MohammadThe data DOI for your dataset will be http://dx.doi.org/10.17867/10000105 and this can be put in your publication. The sentence should be along the lines of 'Image files are available in the Image Data Resource under DOI http://dx.doi.org/10.17867/10000105'.I have also attached the depositor agreement for the University of Dundee, which one of the authors should sign and then ideally scan and email back to us. We have now been able to test load a few plates and they look fine so we'll go ahead and get them all into private version of IDR ready for the next data release.   I am looking at the annotations now and will let you know if I have any questions. Best regardsEleanor On 01/03/2017 15:48, Eleanor Williams wrote: Great, thanks.  I'll ask for the DOI to be generated and email it to you when we get it. Best regardsEleanor On 01/03/2017 15:39, Mohammad Hossein Rohban wrote: Thanks! Indeed we have changed the title to “Systematic morphological profiling of human gene and allele function via Cell Painting”. Everything else is precise in the attached excel file. Known ORCIDs of the authors are :Mohammad H. Rohban : 0000-0001-6589-850XAnne E. Carpenter : 0000-0003-1555-8261 Best,Mohammad On Mar 1, 2017, at 7:20 AM, Eleanor Williams wrote: Hi MohammadThe data is still in the process of being added to the IDR as we've had a few infrastructure changes in the last month which has prevented data loading.  But we can create a place holder and get a data DOI for you to put in the paper in the next couple of days. To get the data DOI minted I need to submit basic details including the dataset creators (paper authors) and license information.  Could you check over the attached spreadsheet to make sure the details are correct?    If you know the ORCID IDs (https://orcid.org/) of any of the authors it would be useful to have them.  If you want to add any subject keywords they can be added on the 'subject' line.   The default license is CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/).  After minting the data DOI, the University of Dundee will create a depositor agreement which I can email you for signing.  It can then be scanned and returned.   We anticipate that the raw and corrected images with annotations would be live in IDR by 15th March.  The segmented images might take a little longer.  I hope this will with your publication time frame.  Best regardsEleanor On 28/02/2017 22:14, Mohammad Hossein Rohban wrote: Hi Eleanor, Hope all is well!We have just submitted a revision of our paper related to this dataset and the editor asked us to provide details of the image dataset we are using (actually an access URL). I was wondering if the data has been processed at your end and if there is any access URL available for it? Thanks,Mohammad  On Feb 7, 2017, at 9:10 AM, Eleanor Williams wrote: The hard drive has arrived safely and I'll try and look at it as soon as I can.   Best regardsEleanor On 02/02/2017 21:44, Jeanelle Ackerman wrote: Hi, I just dropped this hard-drive in the mail today so it should be back to you shortly.  Please let me know when it's been received, thanks.  Best,  Jeanelle On Tue, Dec 6, 2016 at 5:23 AM, Eleanor Williams  wrote: Dear Anne, Mohammad and Jeanelle A 1Tb hard-drive is on its way to you (addressed to Anne) by Fedex.  If you could put the image files and any metadata files you have on there and send it back to me that I would be great.  I have included a return address label in the package. Best regards Eleanor ---------- From: Shantanu Singh Date: Tue, Mar 7, 2017 at 7:59 PM To: Anne Carpenter Cc: Mohammad Hossein Rohban On Mon, Mar 6, 2017 at 9:22 AM, Anne Carpenter wrote: Great! SS or MB would be best suited to answer about mapping and scaling of channels. It's an important Q and will ease our future work if we get the right answer. (1) I don't have a quick answer about mapping – we had discussed this before in the group and there was no satisfactory conclusion. If we need an immediate (<2 weeks) answer then I'd suggest going with their defaults. (I'll message on Slack to see if anyone recollects how that fancy image for the Science article was made; I think we thought about color mapping for that) (2)For scaling, we have a better sense:  for illumination corrected images – rescale 0-65535 to 0-255 when displaying. This is ok because the images are already illum corrected so we know that plate-to-plate variations are already adjusted for. Mohammad – worth checking whether this indeed works out ok by testing for a few images. for raw images – go with their defaults because we don't really have an easy way of setting scales  ---------- From: Shantanu Singh Date: Tue, Mar 7, 2017 at 8:02 PM To: Anne Carpenter Cc: Mohammad Hossein Rohban On Tue, Mar 7, 2017 at 7:59 PM, Shantanu Singh wrote: > (I'll message on Slack to see if anyone recollects how that fancy image for > the Science article was made; I think we thought about color mapping for > that) https://broadinstitute.slack.com/archives/ip-general/p1488934898000419 ---------- From: Shantanu Singh Date: Thu, Mar 23, 2017 at 7:10 AM To: Anne Carpenter Cc: Mohammad Hossein Rohban On Tue, Mar 7, 2017 at 8:02 PM, Shantanu Singh wrote: >> (I'll message on Slack to see if anyone recollects how that fancy image for >> the Science article was made; I think we thought about color mapping for >> that) > > https://broadinstitute.slack.com/archives/ip-general/p1488934898000419 PS – I didn't end up asking w Mark/David; please go ahead and ask them ---------- From: Shantanu Singh Date: Thu, Mar 23, 2017 at 7:14 AM To: Anne Carpenter Cc: Mohammad Hossein Rohban Also, the page says 12 plates whereas we have only 6 https://twitter.com/jrswedlow/status/844626088512933888 May be because they are counting the illumination corrected ones too? In any case, please let them know ---------- From: Shantanu Singh Date: Thu, Mar 23, 2017 at 7:19 AM To: Anne Carpenter Cc: Mohammad Hossein Rohban Also worth checking to make sure they have got the rest of the metadata correct: https://github.com/IDR/idr-metadata/tree/master/idr0033-rohban-pathways e.g. they are using the old channel names (PhGolgi, etc.) On Thu, Mar 23, 2017 at 7:14 AM, Shantanu Singh ---------- From: Anne Carpenter Date: Fri, Mar 24, 2017 at 8:28 AM To: Shantanu Singh Cc: Mohammad Hossein Rohban Ach! Amazing you noticed this. Mohammad will you be able to track this down and help them fix what's needed? Anne E. Carpenter, Ph.D.Director, Imaging PlatformBroad Institute of Harvard and MIT415 Main Street, Cambridge MA 02142 phone: (617) 714-7750anne@broadinstitute.orghttp://www.broadinstitute.org/~anne ---------- From: Mohammad Hossein Rohban Date: Tue, Mar 28, 2017 at 10:28 AM To: Anne Carpenter Cc: Shantanu Singh The metadata is good, but I am going to email them about the number of plates.  —Mohammad ---------- From: Anne Carpenter Date: Mon, Apr 3, 2017 at 10:33 AM To: Mohammad Hossein Rohban , Shantanu Singh Shantanu, you can skip reading below, the question is: both the illum-corrected and raw images will be available for *download* from IDR (once they set up download functionality) so our decision right now is which one do we want available for browsing. I vote for illum corrected personally. Anne On Mon, Apr 3, 2017 at 8:40 AM, Mohammad Hossein Rohban wrote: Hi Anne Eleanor asked me about some visualization settings of TA ORF at IDR and whether we want to illumination corrected plates just for download (2 last emails). I was not sure about them. Do you have any preferences about them? —Mohammad  Begin forwarded message: From: Eleanor Williams Subject: Re: [idr-submission] Hard-drive for Target Accelerator ORF data to go in IDR Date: April 3, 2017 at 6:55:25 AM EDT To: Mohammad Hossein Rohban Cc: "Eleanor Williams (Staff)" , "idr-submission@openmicroscopy.org.uk" Hi MohammadJust checking about whether you want just the 6 raw image plates in IDR for now?Best regardsEleanor On 29/03/2017 17:32, Eleanor Williams wrote: Hi MohammadOk, I will update the annotations with the transcripts listed in idr0033-screenA-library_new.txt and add to this the Quality Control columns.  Would you rather that in the IDR website we just list the 6 raw image plates and then perhaps the illumination corrected versions can be just available for download (when we get the download facility ready)?  I can easily delete the 6 illumination corrected plates from the IDR website. Have you had any thoughts about the rendering settings (a max and min intensity we could use for all raw images) and the channel colours?Best regardsEleanor On 28/03/2017 17:12, Mohammad Hossein Rohban wrote: Hi Eleanor Sorry for replying to this with delay. In case of the discrepancy, we would like to use idr0033-screenA-library_new.txt which has been obtained directly from GPP. And I agree the idea of adding the Quality control column to the library file makes it clearer. Thanks for doing this. I also wanted to mention that the number of Plate Count was set to 12 in the IDR website, while we only have 6 (and the other 6 were just illumination corrected versions of the same plates). Would you please set it to 6 instead? Thanks,Mohammad On Mar 15, 2017, at 5:29 PM, Eleanor Williams wrote: Hi MohammadDue to technical problems our release hasn't happened yet today, but will hopefully happen very soon. I'll let you and Anne know when it has.  In this release I haven't included the Transcript identifiers because time was running short and the identifiers you listed in idr0033-screenA-library_new.txt are not the same as those in Supplemental Table 1 and I wanted to double check this. e.g. RICTOR_WT/ccsbBroad304_13449 has a target transcript of  BC029608.1 in the table and XM_006714463.3 in your file.  Happy to use the ones in your file, or not include this if column if its a complicated issue.   I was also not expecting  transcript identifiers when there is the comment of "ORF did not map to any transcripts" but I see now that in the study file there is a note that  "The ORF sequence is compared against the target transcript; ORFs with matching percentage of less than 99% in either nucleotide or protein sequences are filtered out".  Maybe we should add a Quality Control column to the library file and enter 'Fail' for these rows and put a  "Quality Control Comment" of "ORF matched transcript with percentage less than 99% in either nucleotide or protein sequences"?  Example of what I mean is attached. If you'd rather just leave it as it is then that is fine but I thought this might make it clearer that these images were not used in the analyses. Best regardsEleanor On 14/03/2017 12:12, Mohammad Hossein Rohban wrote: Hi Eleanor  That’s great! Thanks for handling this. Here are the answers: 1. Yes.2. Unfortunately they do not directly map to them. One has to use the actual ORF sequence to map them. 3. I think that would be great. Unfortunately, BRDN samples are only shown only in the internal GPP portal (I think because of the privacy reasons). 4. I included the Transcript as a column and you can find the new library file attached.5. We prefer to use 'small asymmetric cells’. 6. Note that both of the two sql files should be imported in the same database. It appears that you are importing them in separate databases.7. For each treatment, there are 5 different replicates. We obtain correlations between all pairs of profiles corresponding to these replicates. This gives us 10 correlation values. Median replicate correlation is then defined as the median of these 10 numbers. —Mohammad On Mar 10, 2017, at 10:07 AM, Eleanor Williams wrote: Hi MohammadI've been through all the annotation files you sent and slightly reformatted them but no major changes except that I added in the _illum_corrected plates to the library file and added the phenotypes to the processed file.  I've attached all the files to this email for you to check over.   If there are no major issues then we will get these annotations added to the images on Monday or Tuesday next week. If necessary we can update them again after that.  I had a few minor questions for you:1. Does EMPTY mean untreated cells? 2. Do you know if the clone identifiers you have listed map directly to identifiers in the human ORFeome resource (http://horfdb.dfci.harvard.edu/)? The reason I ask is that we have another high content screen using clones from the human ORFeome resource and I wondered if we could link between any of them.  3. Would it be useful for us to link out to the ORF sequence alignments in the gpp portal?  I found that the 'ccsb' type clone IDs linked to this. But not the 'BRDN*` type ones - is information about them held anywhere?4. I also wondered if it would be useful for us to add the Target Transcripts from Suppl Table 1 to the library file.  If you have this information accessible would you be able to add it?5. I noticed a slight difference in terms used for one of the phenotypes - 'small cells (condensed)' in paper vs 'small asymmetric cells' in the file I was sent.  Which would you rather use? 6. I tried importing the two sql files of feature data into mysql databases. TargetAccelerator.sql worked fine but i got the following error with the other onemysql -u root -p idr0033_Per_Object_View < Per_Object_View.sql ERROR 1146 (42S02) at line 1: Table 'idr0033_per_object_view.sigma2_pilot_2013_10_11_analysis_per_nuclei' doesn't exist7. Could you give me a short description of what the 'Median Replicate Correlation' is? I think that's all.  For now I have not mapped any of your phenotypes to ontology terms as we don't have good ways of expressing enriched or de-enriched phenotypes but I'd like work on this in future.  If you could let me know if the files I have attached are OK as soon as possible that would be great. I can add any other extra bits of information later. Also we need the depositor agreement returned. Have a good weekend.  Best regardsEleanor On 07/03/2017 18:36, Mohammad Hossein Rohban wrote: Hi Eleanor, Thanks! We will soon let you know about the rendering setting. —Mohammad On Mar 3, 2017, at 6:14 PM, Eleanor Williams wrote: I forgot that I wanted to ask you about rendering settings.  At the moment there is a green channel, a red channel and 3 blue channels, in both the raw image and illumination corrected images (screenshots attached).   Would you like to change the color of some of the channels and is there a particular max and min value you'd like applied across all plates (or all raw and all illumination corrected plates)?Best regardsEleanor   On 03/03/2017 22:52, Eleanor Williams wrote: Hi MohammadThe data DOI for your dataset will be http://dx.doi.org/10.17867/10000105 and this can be put in your publication. The sentence should be along the lines of 'Image files are available in the Image Data Resource under DOI http://dx.doi.org/10.17867/10000105'.I have also attached the depositor agreement for the University of Dundee, which one of the authors should sign and then ideally scan and email back to us.  We have now been able to test load a few plates and they look fine so we'll go ahead and get them all into private version of IDR ready for the next data release.   I am looking at the annotations now and will let you know if I have any questions.  Best regardsEleanor On 01/03/2017 15:48, Eleanor Williams wrote: Great, thanks.  I'll ask for the DOI to be generated and email it to you when we get it.  Best regardsEleanor On 01/03/2017 15:39, Mohammad Hossein Rohban wrote: Thanks! Indeed we have changed the title to “Systematic morphological profiling of human gene and allele function via Cell Painting†. Everything else is precise in the attached excel file. ---------- From: Shantanu Singh Date: Mon, Apr 3, 2017 at 1:28 PM To: Anne Carpenter Cc: Mohammad Hossein Rohban On Mon, Apr 3, 2017 at 10:33 AM, Anne Carpenter wrote: > I vote for illum corrected personally. I agree ---------- From: Anne Carpenter Date: Mon, Apr 3, 2017 at 1:42 PM To: Shantanu Singh Cc: Mohammad Hossein Rohban Right now it appears both copies are visible at IDR and it's clear which are corrected and which not - Mohammad, why not just keep it as is? ---------- From: Mohammad Hossein Rohban Date: Mon, Apr 3, 2017 at 1:43 PM To: Anne Carpenter Cc: Shantanu Singh Apparently if we keep it as is, the number of plates would be automatically shown as 12.  ---------- From: Anne Carpenter Date: Mon, Apr 3, 2017 at 1:44 PM To: Mohammad Hossein Rohban Cc: Shantanu Singh If that is how that number arises and it can't be changed except by deleting data, I think that is ok to leave as is. Now we are aware that # plates = # plates of data uploaded rather than # of plates tested in the experiment, I can live with it as is. ```
gwaybio commented 4 years ago

Great! Thanks for providing this context @shntnu - I'd like to include both raw and illumination corrected images.

I see the illumination correction functions (.mat files), but I will need help applying them.

hkhawar commented 4 years ago

Greg if you want, I can help you out in getting the illumination corrected images

On Wed, Feb 26, 2020 at 4:40 PM Greg Way notifications@github.com wrote:

Great! Thanks for providing this context @shntnu https://github.com/shntnu - I'd like to include both raw and illumination corrected images.

I see the illumination correction functions (.mat files), but I will need help applying them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106?email_source=notifications&email_token=AIUGCWNUPSMUG6L2LM54IW3RE3OWBA5CNFSM4K3RAT72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENB73IY#issuecomment-591658403, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWMHCY7C2MLVU7LRITTRE3OWBANCNFSM4K3RAT7Q .

gwaybio commented 4 years ago

@hkhawar - yes please! I will find a time on your calendar for a quick meeting

hkhawar commented 4 years ago

Sure

On Wed, Feb 26, 2020 at 4:54 PM Greg Way notifications@github.com wrote:

@hkhawar https://github.com/hkhawar - yes please! I will find a time on your calendar for a quick meeting

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106?email_source=notifications&email_token=AIUGCWLJVCZWOPFKIOE27X3RE3QLFA5CNFSM4K3RAT72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENCBIIA#issuecomment-591664160, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWOIVNAGCF4WEBLAIB3RE3QLFANCNFSM4K3RAT7Q .

shntnu commented 4 years ago

Hamdah reprocessed some illum corrected files that were corrected and stored them in folders like this

s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp/SQ00014610/illum_corrected/

I am now going to copy these to their corresponding original locations e.g. here

s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014610/Images/

using this command

origpath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images

temppath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp

# copy all files (the ones in the temppath will fail)
parallel \
    --header ".*\n" \
    -C "," \
    -a corrupted_image.csv \
    aws s3 cp ${temppath}/{1}/illum_corrected/{2} ${origpath}/{1}/Images/{2}

corrupted_image.csv is available here

This step revealed that some files were missing in the tmp folder:

parallel \
    --header ".*\n" \
    -C "," \
    -a corrupted_image.csv \
    "if ! aws s3 ls ${temppath}/{1}/illum_corrected/{2} > /dev/null; then echo Temp path - {1}/{2} missing; fi"
Temp path - SQ00014613/r07c21f05p01-ch2sk1fk1fl1.tiff missing
Temp path - SQ00014613/r06c04f05p01-ch5sk1fk1fl1.tiff missing
Temp path - SQ00014613/r10c08f05p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014613/r08c19f04p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014613/r02c08f08p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014610/r02c13f02p01-ch1sk1fk1fl1.tiff missing
Temp path - SQ00014610/r16c19f02p01-ch3sk1fk1fl1.tiff missing
Temp path - SQ00014610/r07c07f03p01-ch2sk1fk1fl1.tiff missing
Temp path - SQ00014614/r09c07f01p01-ch5sk1fk1fl1.tiff missing
gwaybio commented 4 years ago

thank you Shantanu ❤️ (and Hamdah too for the upfront processing)

shntnu commented 4 years ago

Steps to perform once the missing files listed at the end of https://github.com/broadinstitute/cell-health/issues/106#issuecomment-663154084 are recreated

  1. Make sure all the files are present
temppath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp

parallel \
    --header ".*\n" \
    -C "," \
    -a corrupted_image.csv \
    "if ! aws s3 ls ${temppath}/{1}/illum_corrected/{2} > /dev/null; then echo Temp path - {1}/{2} missing; fi"
  1. Copy files to the original location; make sure there are no errors
origpath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images

temppath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp

# copy all files (the ones missing in the temppath will fail)
parallel \
    --header ".*\n" \
    -C "," \
    -a corrupted_image.csv \
    aws s3 cp ${temppath}/{1}/illum_corrected/{2} ${origpath}/{1}/Images/{2}
  1. Download files
parallel \
    mkdir -p illumcorrected_CRISPR_PILOT_B1/images/{1} ::: SQ00014610 SQ00014611 SQ00014612 SQ00014613 SQ00014614 SQ00014615 SQ00014616 SQ00014617 SQ00014618 

origpath=s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images

parallel \
    --header ".*\n" \
    -C "," \
    -a corrupted_image.csv \
    aws s3 cp ${origpath}/{1}/Images/{2} illumcorrected_CRISPR_PILOT_B1/images/{1}/Images/{2}
  1. brew install imagemagick to do a quick test of fidelity after downloading
parallel \
    --header ".*\n" \
    -C "," \
    -a corrupted_image.csv \
    identify illumcorrected_CRISPR_PILOT_B1/images/{1}/Images/{2} | grep "Can not read TIFF"
  1. Check file sizes. Files that are unusually small may be corruped
aws s3 ls --recursive s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images |grep tiff > /tmp/image_files.txt

# get file sizes and counts
cat /tmp/image_files.txt |tr -s " "|cut -d" " -f3|sort -n|uniq -c

Once you've confirmed everything works, you can have IDR run step 3 at their end.

hkhawar commented 4 years ago

Corrected images are in the separate tmp folder on S3 platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp/ I didn't replace those one in original illumcorrected_CRISPR_PILOT_B1 folder

On Thu, Jul 23, 2020 at 1:43 PM Shantanu Singh notifications@github.com wrote:

Steps

  1. brew install imagemagick if you want to do a quick test of fidelity after downloading using identify
  2. Download copy_illumcorrected_CRISPR_PILOT_B1.sh.txt https://github.com/broadinstitute/cell-health/files/4968076/copy_illumcorrected_CRISPR_PILOT_B1.sh.txt and rename to .sh
  3. chmod +x copy_illumcorrected_CRISPR_PILOT_B1.sh
  4. run it ./copy_illumcorrected_CRISPR_PILOT_B1.sh

I noticed two issues:

  1. The first file does not exist i.e. SQ00014613/Images/r06c04f05p01-ch5sk1fk1fl1.tiff.
  2. I ran find illumcorrected_CRISPR_PILOT_B1 -name "*.tiff" -exec identify {} \; 2>&1 >/tmp/foo; grep "Can not read " /tmp/foo and found that some files are still corrupted but I think those were never recreated in the first place.

@gwaygenomics https://github.com/gwaygenomics you'd want to repeat these steps yourself and then report back which illumination-corrected files will need to be recreated. The current list is

Missing:

SQ00014613/Images/r06c04f05p01-ch5sk1fk1fl1.tiff

Still corrupted (maybe never recreated?):

SQ00014610/Images/r02c13f02p01-ch1sk1fk1fl1.tiff SQ00014610/Images/r07c07f03p01-ch2sk1fk1fl1.tiff SQ00014610/Images/r16c19f02p01-ch3sk1fk1fl1.tiff SQ00014613/Images/r02c08f08p01-ch1sk1fk1fl1.tiff SQ00014613/Images/r07c21f05p01-ch2sk1fk1fl1.tiff SQ00014613/Images/r08c19f04p01-ch1sk1fk1fl1.tiff SQ00014613/Images/r10c08f05p01-ch1sk1fk1fl1.tiff

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106#issuecomment-663170068, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWOLXTL2EMD7CYNGS6DR5CAFXANCNFSM4K3RAT7Q .

shntnu commented 4 years ago

I didn't replace those one in original illumcorrected_CRISPR_PILOT_B1

Yep, but I did (see comments). I will update this thread once I've figured out the issue – might be something else driving this.

gwaybio commented 4 years ago

Steps to perform once the missing files listed at the end of #106 (comment) are recreated

For my understanding, is this the complete order of operations?

  1. we first need to reprocess these 9 files in https://github.com/broadinstitute/cell-health/issues/106#issuecomment-663154084
  2. Make sure they are in the right folders
  3. Then I perform the 5 steps in https://github.com/broadinstitute/cell-health/issues/106#issuecomment-663170068
  4. Then I confirm the download integrity
  5. Then I give step 3 to IDR

@hkhawar can you help with step 1 above?

Thanks again Shantanu and Hamdah!

hkhawar commented 4 years ago

@gwaygenomics Do I need to process only following nine files? Temp path - SQ00014613/r07c21f05p01-ch2sk1fk1fl1.tiff missing Temp path - SQ00014613/r06c04f05p01-ch5sk1fk1fl1.tiff missing Temp path - SQ00014613/r10c08f05p01-ch1sk1fk1fl1.tiff missing Temp path - SQ00014613/r08c19f04p01-ch1sk1fk1fl1.tiff missing Temp path - SQ00014613/r02c08f08p01-ch1sk1fk1fl1.tiff missing Temp path - SQ00014610/r02c13f02p01-ch1sk1fk1fl1.tiff missing Temp path - SQ00014610/r16c19f02p01-ch3sk1fk1fl1.tiff missing Temp path - SQ00014610/r07c07f03p01-ch2sk1fk1fl1.tiff missing Temp path - SQ00014614/r09c07f01p01-ch5sk1fk1fl1.tiff missing

shntnu commented 4 years ago

I am also concerned some of the files that IDR has not listed as corrupted are actually corrupted. E.g. this one

2020-03-08 10:41:41     743346 projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014615/Images/r02c08f03p01-ch5sk1fk1fl1.tiff`

I downloaded it like this

aws s3 cp s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014615/Images/r02c08f03p01-ch5sk1fk1fl1.tiff .

identify did not report issues

identify ./r02c08f03p01-ch5sk1fk1fl1.tiff
./r02c08f03p01-ch5sk1fk1fl1.tiff TIFF 2160x2160 2160x2160+0+0 16-bit Grayscale Gray 743346B 0.000u 0:00.000

But I'm not able to open the file using Preview ("It may be damaged or use a file format that Preview doesn’t recognize.")

My suspicion is that all the files with infrequent file sizes are actually corrupted files.

Welcome to the rabbit hole! :)

Get the file listing

aws s3 ls --recursive s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images |grep tiff > /tmp/image_files.txt

Now download the files that whose files sizes are infrequent:

library(tidyverse)

sizes <- 
  read_delim("/tmp/image_files.txt", 
             col_names = c("date", "time", "size", "path"), 
             trim_ws = TRUE, 
             delim = " ") %>%
  mutate(download = sprintf("aws s3 cp s3://imaging-platform/%s %s", path, path)) %>%
  mutate(dirpath = dirname(path))

dirpaths <- 
  sizes %>% 
  distinct(dirpath)

dirpaths$dirpath %>% 
  walk(function(dirpath) dir.create(dirpath, showWarnings = FALSE, recursive = TRUE))

frac_sizes <- 
  sizes %>% 
  group_by(size) %>% 
  tally() %>% 
  arrange(desc(size)) %>% 
  mutate(frac = n / sum(n))

frac_sizes %>%
  head() %>%
  knitr::kable()

frac_sizes %>% 
  filter(frac < 0.001) %>%
  select(size) %>%
  inner_join(sizes) %>%
  magrittr::extract2("download") %>%
  walk(function(download) system(download))

I ran that and then did a random sampling of images by trying to open using Preview and found that all in that random sample were corrupted. This is the full list of all files downloaded (below).

@gwaygenomics I gotta run but hopefully, you can take it from here and figure out the next steps. If not, ping me on this and I'll have a look once back from vacation

projects/
└── 2015_07_01_Cell_Health_Vazquez_Cancer_Broad
    └── illumcorrected_CRISPR_PILOT_B1
        └── images
            ├── SQ00014610
            │   └── Images
            │       ├── r01c18f01p01-ch4sk1fk1fl1.tiff
            │       ├── r01c19f08p01-ch5sk1fk1fl1.tiff
            │       ├── r02c07f06p01-ch2sk1fk1fl1.tiff
            │       ├── r02c13f02p01-ch1sk1fk1fl1.tiff
            │       ├── r04c01f01p01-ch5sk1fk1fl1.tiff
            │       ├── r07c07f03p01-ch2sk1fk1fl1.tiff
            │       ├── r10c12f05p01-ch2sk1fk1fl1.tiff
            │       ├── r13c03f08p01-ch1sk1fk1fl1.tiff
            │       ├── r13c09f01p01-ch2sk1fk1fl1.tiff
            │       ├── r16c19f02p01-ch3sk1fk1fl1.tiff
            │       └── r16c20f07p01-ch4sk1fk1fl1.tiff
            ├── SQ00014611
            │   └── Images
            │       ├── r02c18f03p01-ch1sk1fk1fl1.tiff
            │       ├── r06c11f02p01-ch2sk1fk1fl1.tiff
            │       └── r14c08f07p01-ch5sk1fk1fl1.tiff
            ├── SQ00014612
            │   └── Images
            │       ├── r03c08f01p01-ch4sk1fk1fl1.tiff
            │       ├── r06c06f08p01-ch5sk1fk1fl1.tiff
            │       ├── r10c15f07p01-ch1sk1fk1fl1.tiff
            │       ├── r11c05f02p01-ch5sk1fk1fl1.tiff
            │       └── r13c08f06p01-ch4sk1fk1fl1.tiff
            ├── SQ00014613
            │   └── Images
            │       ├── r02c08f08p01-ch1sk1fk1fl1.tiff
            │       ├── r03c15f04p01-ch4sk1fk1fl1.tiff
            │       ├── r07c05f02p01-ch1sk1fk1fl1.tiff
            │       ├── r07c21f05p01-ch2sk1fk1fl1.tiff
            │       ├── r08c19f04p01-ch1sk1fk1fl1.tiff
            │       ├── r10c08f05p01-ch1sk1fk1fl1.tiff
            │       └── r11c18f08p01-ch2sk1fk1fl1.tiff
            ├── SQ00014614
            │   └── Images
            │       ├── r03c04f01p01-ch4sk1fk1fl1.tiff
            │       ├── r03c07f05p01-ch5sk1fk1fl1.tiff
            │       ├── r05c09f08p01-ch1sk1fk1fl1.tiff
            │       ├── r09c07f01p01-ch5sk1fk1fl1.tiff
            │       └── r15c02f03p01-ch4sk1fk1fl1.tiff
            ├── SQ00014615
            │   └── Images
            │       ├── r02c08f03p01-ch5sk1fk1fl1.tiff
            │       ├── r02c14f04p01-ch1sk1fk1fl1.tiff
            │       ├── r08c07f01p01-ch3sk1fk1fl1.tiff
            │       ├── r08c07f07p01-ch5sk1fk1fl1.tiff
            │       ├── r08c14f07p01-ch1sk1fk1fl1.tiff
            │       ├── r09c07f03p01-ch1sk1fk1fl1.tiff
            │       ├── r10c09f08p01-ch5sk1fk1fl1.tiff
            │       ├── r10c18f03p01-ch2sk1fk1fl1.tiff
            │       ├── r13c21f07p01-ch2sk1fk1fl1.tiff
            │       ├── r15c15f08p01-ch4sk1fk1fl1.tiff
            │       └── r16c21f05p01-ch1sk1fk1fl1.tiff
            ├── SQ00014616
            │   └── Images
            │       ├── r01c17f07p01-ch5sk1fk1fl1.tiff
            │       ├── r02c21f01p01-ch1sk1fk1fl1.tiff
            │       ├── r03c19f02p01-ch5sk1fk1fl1.tiff
            │       ├── r07c04f03p01-ch1sk1fk1fl1.tiff
            │       └── r14c17f03p01-ch2sk1fk1fl1.tiff
            ├── SQ00014617
            │   └── Images
            │       ├── r02c23f05p01-ch1sk1fk1fl1.tiff
            │       ├── r03c06f02p01-ch4sk1fk1fl1.tiff
            │       ├── r06c01f08p01-ch4sk1fk1fl1.tiff
            │       ├── r06c16f02p01-ch2sk1fk1fl1.tiff
            │       ├── r08c16f07p01-ch4sk1fk1fl1.tiff
            │       ├── r11c14f07p01-ch2sk1fk1fl1.tiff
            │       ├── r12c04f02p01-ch3sk1fk1fl1.tiff
            │       ├── r12c08f04p01-ch4sk1fk1fl1.tiff
            │       ├── r12c10f04p01-ch5sk1fk1fl1.tiff
            │       ├── r13c09f07p01-ch4sk1fk1fl1.tiff
            │       └── r15c14f04p01-ch2sk1fk1fl1.tiff
            └── SQ00014618
                └── Images
                    ├── r01c14f08p01-ch1sk1fk1fl1.tiff
                    ├── r03c09f07p01-ch1sk1fk1fl1.tiff
                    ├── r03c09f07p01-ch5sk1fk1fl1.tiff
                    ├── r03c12f06p01-ch4sk1fk1fl1.tiff
                    ├── r05c10f08p01-ch5sk1fk1fl1.tiff
                    ├── r06c01f07p01-ch1sk1fk1fl1.tiff
                    ├── r07c09f07p01-ch1sk1fk1fl1.tiff
                    ├── r13c05f04p01-ch5sk1fk1fl1.tiff
                    ├── r14c10f02p01-ch3sk1fk1fl1.tiff
                    └── r16c23f01p01-ch2sk1fk1fl1.tiff
shntnu commented 4 years ago

@gwaygenomics I just saw https://github.com/broadinstitute/cell-health/issues/106#issuecomment-663217930

Yes, that's the right order of operations.

But @hkhawar, unfortunately, you will also need to reprocess those files listed at the end of https://github.com/broadinstitute/cell-health/issues/106#issuecomment-663224408 because my random sampling revealed that those are also corrupted. I have no clue why so many files are getting corrupted but hopefully you will figure that out.

@hkhawar Thanks very much for helping out!

shntnu commented 4 years ago

@hkhawar one more thing – could you please briefly describe the setup you are using to reprocess these images? Are you mounting the S3 bucket on your computer and running it on your computer by any chance? If so, I think that could be the issue because S3 mounts suck with heavy I/O.

hkhawar commented 4 years ago

@shntnu I ran this experiment on AWS. I am not sure why we have gotten lot of corrupted images. I could guess something happened during running DCP and instead of ending up in dead message queues for unfinished jobs. They somehow created an image file with 0 Bytes

shntnu commented 4 years ago

@hkhawar Thanks for clarifying. Very strange! And note that the issue is that some output files are actually pretty large e.g. 8Mb but are still corrupted. Worth checking in with Beth on this via Slack.

hkhawar commented 4 years ago

@gwaygenomics Could you please do the same thing that you did before Sorting other channels for these images?

hkhawar commented 4 years ago

@shntnu Sure I will check with Beth on this tomorrow

shntnu commented 4 years ago

Here's an example: r06c11f02p01-ch2sk1fk1fl1.tiff.zip located at projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/images/SQ00014611/Images/r06c11f02p01-ch2sk1fk1fl1.tiff

It doesn't open using Preview:

image

But it does open in Fiji but the bottom pixels are missing

image

shntnu commented 4 years ago

I've posted this internally https://broadinstitute.slack.com/archives/G3QFDHXC4/p1595538827014000

hkhawar commented 4 years ago

@gwaygenomics if you can sort other channels for the corrupted files for me as you did last time. Then I will reprocess them today?

gwaybio commented 4 years ago

@gwaygenomics if you can sort other channels for the corrupted files for me as you did last time. Then I will reprocess them today?

Sure - what folder do you want them in? Also, do you think reprocessing them the same way as before is a good idea? (are you going to do anything different?)

hkhawar commented 4 years ago

I am doing it locally. Just make a tmp2 folder on S3 and dump new set of images for each plate? Later we delete these tmp folders from S3

shntnu commented 4 years ago

I am doing it locally. Just make a tmp2 folder on S3 and dump new set of images for each plate? Later we delete these tmp folders from S3

For our notes, could you pen down why they need to be in a new folder (vs creating a loaddata file pointing to the original locations?) Will be useful to know when we need to reprocess small batches

hkhawar commented 4 years ago

I was avoiding to use load_data.csv and wanted to download images locally and using CellProfiler locally to reprocess files. This is how I typically do for small set of images.

bethac07 commented 4 years ago

Occasionally, CellProfiler just stochastically seems to do this- any operation, even write or sync, will sometimes stochastically just go ker-flop, and when we're working on 10K/100K/1M/10M images, the likelihood it will happen >=1 times becomes significant. Since each plate has ~21K images, based on the list above, the likelihood is in the 1-to-low-thousands.

If there's a problem with the source image, obviously that's one thing; if the problem is truly stochastic (aka when you run the same image again the output file comes out fine), there isn't a ton to do (though if these were done <60 days ago it's worth checking the logs for the known bad sites since that's easy while the logs are still in CloudWatch). If we think the file is being written correctly, but not synced correctly, we could always institute a 30 or 60 second pause after the CellProfiler pipeline is done before syncing.

It's worth noting we can very easily handle the ones where files are small (obviously corrupted) using the MIN_FILE_SIZE option I added to DCP by just resubmitting the whole batch with CHECK_IF_DONE set to TRUE and MIN_FILE_SIZE set small- anything with the right number of files > a certain size will just get skipped, and it will re-process just the ones where 1+ file is tiny. If either the uncorrupted OR corrupted files have a stereotyped size, which Shantanu your methodology seems to imply, you could imagine other similar checks we could add; essentially either

if filesize in accepted_file_sizes:
    goodfile_count +=1
if goodfile_count >= N:
    reprocess = False

or

if filesize not in known_bad_file_sizes:
    goodfile_count +=1
if goodfile_count >= N:
    reprocess = False
gwaybio commented 4 years ago

@hkhawar

the corrupted files are ready to go! located at /home/ubuntu/bucket/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp_version_two thanks!

hkhawar commented 4 years ago

@bethac07 Logs are not available now. I guess is that problem happened during syncing of files. On redoing reprocessing those images again just worked fine @gwaygenomics thanks I am going to work on it

shntnu commented 4 years ago

It's worth noting we can very easily handle the ones where files are small (obviously corrupted) using the MIN_FILE_SIZE option I added to DCP by just resubmitting the whole batch with CHECK_IF_DONE set to TRUE and MIN_FILE_SIZE set small- anything with the right number of files > a certain size will just get skipped, and it will re-process just the ones where 1+ file is tiny. If either the uncorrupted OR corrupted files have a stereotyped size, which Shantanu your methodology seems to imply, you could imagine other similar checks we could add; essentially either

Thanks for clarifying @bethac07 🥇.

@hkhawar details are below but tl;dr: we could have gone with fixed file size because these are uncompressed TIFFS so I think they should all be the same file size. But there's one aberration (below). So instead let's go with CHECK_IF_DONE=TRUE and MIN_FILE_SIZE = 9348718.

Details

I dug into this a bit for our future reference with this kind of issue.

frac_sizes %>% head() %>% knitr::kable()

From this table, looks like 9348786 is the value to go with. But I don't know what's happening with 9348718 – why are there 1240 instances of that? No clue. Also, files with size 9348718 open fine with Preview.

size n frac
9348786 136926 0.9905378
9348718 1240 0.0089703
9210546 1 0.0000072
8795826 1 0.0000072
8683506 1 0.0000072
8631666 1 0.0000072

All other sizes have only 1-2 occurrences (except 8 which occurs 8 times).

frac_sizes %>% filter(size < 9348718) %>% count(n) %>% knitr::kable()
n nn
1 56
2 2
8 1

93487181 is certainly special because if any one channel of a site has that value, then all channels have that value

sizes %>% filter(size == 9348718)  %>% mutate(site = basename(path), plate = str_match(dirpath, "SQ[0-9]{8}"))  %>% separate(site, c("site", "channel"), sep = "-") %>% group_by(site, plate) %>% tally() %>% ungroup() %>% arrange(site) %>% count(n)
n nn
5 248
hkhawar commented 4 years ago

@gwaygenomics I have reprocessed illum corrected images and they are available in the same folder

s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp_version_two/

Note: I haven't synced them to s3://imaging-platform/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/illumcorrected_CRISPR_PILOT_B1/

I guess you can do it

@shntnu I have no idea why we get images of this 93487181 size. Did you try opening an image of this size in Fiji?

gwaybio commented 4 years ago

Progress

gwaybio commented 4 years ago

Download integrity confirmed! This is the output of the R code in https://github.com/broadinstitute/cell-health/issues/106#issuecomment-663224408:

size n frac
9348786 136926 0.9905306
9348718 1309 0.0094694

🎉

All that remains is to send IDR the S3 links

gwaybio commented 4 years ago

One potentially interesting observation is that all of the corrupted files that we needed to fix ended up having the smaller file size listed above.

gwaybio commented 4 years ago

next hurdle incoming!

Summary

IDR has all non-illumination corrected images, but they are missing 1,925 illumination corrected images.

Specifics

The folks at IDR are working towards verifying the submission. A couple of points that either @hkhawar or @shntnu might know the answer to right away.

  1. Images with f09 in their name are missing from the illumination corrected set (there are 1920 of these).
  2. There are 5 additional images missing in the illumination corrected set all from plate SQ00014610

Issue 1 - Missing f09

Here are example images:

r16c24f09p01-ch2sk1fk1fl1.tiff
r16c24f09p01-ch3sk1fk1fl1.tiff
r16c24f09p01-ch4sk1fk1fl1.tiff
r16c24f09p01-ch5sk1fk1fl1.tiff

Issue 2 - Five more

r16c24f01p01-ch1sk1fk1fl1.tiff
r16c24f01p01-ch2sk1fk1fl1.tiff
r16c24f01p01-ch3sk1fk1fl1.tiff
r16c24f01p01-ch4sk1fk1fl1.tiff
r16c24f01p01-ch5sk1fk1fl1.tiff
hkhawar commented 4 years ago

@Gregory Way gway@broadinstitute.org its a huge pain. Again I think it is related to same problem not transferring them to S3 properly and produced corrupted and missing image files. if they provided us a list of missing illum images then I have to redo it again

On Tue, Aug 25, 2020 at 2:42 PM Greg Way notifications@github.com wrote:

next hurdle incoming! Summary

IDR has all non-illumination corrected images, but they are missing 1,925 illumination corrected images. Specifics

The folks at IDR are working towards verifying the submission. A couple of points that either @hkhawar https://github.com/hkhawar or @shntnu https://github.com/shntnu might know the answer to right away.

  1. Images with f09 in their name are missing from the illumination corrected set (there are 1920 of these).
  2. There are 5 additional images missing in the illumination corrected set all from plate SQ00014610

Issue 1 - Missing f09

Here are example images:

r16c24f09p01-ch2sk1fk1fl1.tiff r16c24f09p01-ch3sk1fk1fl1.tiff r16c24f09p01-ch4sk1fk1fl1.tiff r16c24f09p01-ch5sk1fk1fl1.tiff

Issue 2 - Five more

r16c24f01p01-ch1sk1fk1fl1.tiff r16c24f01p01-ch2sk1fk1fl1.tiff r16c24f01p01-ch3sk1fk1fl1.tiff r16c24f01p01-ch4sk1fk1fl1.tiff r16c24f01p01-ch5sk1fk1fl1.tiff

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106#issuecomment-680231705, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWKWXJQLADG6RA2HDBDSCQHZNANCNFSM4K3RAT7Q .

gwaybio commented 4 years ago

Argh! Is there something that I can do to ease the pain? Transfer files into a new folder again? It seems like this is an AWS transfer issue?

hkhawar commented 4 years ago

Yup that would be a great help. Let me know once they are done. I will be work on it.

On Tue, Aug 25, 2020 at 3:06 PM Greg Way notifications@github.com wrote:

Argh! Is there something that I can do to ease the pain? Transfer files into a new folder again? It seems like this is an AWS transfer issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cell-health/issues/106#issuecomment-680242919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUGCWJFPJQDDKMPP4J6WOTSCQKV7ANCNFSM4K3RAT7Q .

gwaybio commented 4 years ago

turns out that we actually have 17,285 illum corrected files missing.

1,920 "f09" files missing per plate 9 plates 5 "f01" files missing only in plate SQ00014610 1,920 * 9 + 5 = 17,285

Transfer files into a new folder again?

Yup that would be a great help. Let me know once they are done.

I have confirmed that all of these files are now in a separate folder. The folder is/home/ubuntu/bucket/projects/2015_07_01_Cell_Health_Vazquez_Cancer_Broad/tmp_version_three.

Note that the subfolder is tmp_version_three.

@hkhawar all set for the next (and hopefully final!) iteration of the illum correction pipeline. Thanks again