broadinstitute / cellpainting-gallery

Cell Painting Gallery
https://broadinstitute.github.io/cellpainting-gallery/
MIT License
58 stars 11 forks source link

2022_07_13_CDRP (cpg0012) #13

Closed shntnu closed 1 year ago

shntnu commented 2 years ago

Segmentation/ Feature extraction is being performed by Cimini lab (Erin) Profile creation is being performed by Carpenter-Singh lab (Niranj) Data can be public in RODA Immediately

Update as generated:
Link to profile repo: https://github.com/broadinstitute/2015_Bray_GigaScience-data Link to publication repo: https://doi.org/10.1093/gigascience/giw014 cellpainting-gallery identifier: cpg0012-wawer-bioactivecompoundprofiling

Transfer to CellPainting Gallery:

If data is being published, prepare for publication: ~- [ ] Run Distributed-BioFormats2Raw to create .ome.zarr files~ ~- [ ] Upload (meta)data to IDR (images remain hosted in cellpainting-gallery).~

Data is already available at http://idr.openmicroscopy.org/webclient/?show=screen-1251

Once published:


Note: This is a reprocessing of an existing dataset

shntnu commented 2 years ago

Some notes from the slack msg: Notes from reprocessing the data back in 2017 https://broadinstitute.atlassian.net/wiki/spaces/IP/pages/114638720/2017-04-19+CDP2+data+show+decent+quality+both+for+bioactive+and+DOS+compounds The images now live in s3://cellpainting-gallery/cpg0012-wawer-bioactivecompoundprofiling

ErinWeisbart commented 2 years ago

I'm assuming you'd like to have the original analysis data preserved along with the re-analysis? If so, what should the naming system be?

shntnu commented 2 years ago

Good point, although in this case, we only have the images uploaded, and the only folders in workspace are gigascience* folders that are related to https://github.com/gigascience/paper-bray2017

Therefore, no renaming is needed; we can proceed as we do for any other dataset.

shntnu commented 2 years ago

From https://github.com/broadinstitute/imaging-bbbc/issues/61#issuecomment-565597927

Here's the number of plates per batch

read_tsv("~/Downloads/cdrp_runs.txt") %>% count(run_id, date_started) %>% mutate(i = seq_along(n), .before = "run_id") %>% knitr::kable()
i run_id date_started n
1 2113-01-W01-01-01 2011-03-16 28
2 2113-01-W01-01-02 2011-03-16 32
3 2113-01-W01-01-03 2011-03-21 32
4 2113-01-W01-01-04 2011-03-23 32
5 2113-01-W01-01-06 2011-04-11 32
6 2113-01-W01-01-08 2011-04-19 32
7 2113-01-W01-01-09 2011-04-21 32
8 2113-01-W01-01-11 2011-04-29 32
9 2113-01-W01-01-12 2011-05-24 32
10 2113-01-W01-01-13 2011-06-13 32
11 2113-01-W01-01-14 2011-06-21 40
12 2113-01-W01-01-15 2011-08-09 48

Nikita had done some analysis here https://github.com/broadinstitute/2018_01_09_PUMA_CBTS/issues/9#issuecomment-810403225 that might help us figure out batches (based on clustering DMSOs)

But I think it's fine to proceed with these batches (n=12) and create a pipeline per batch, assuming that's not too onerous.

ErinWeisbart commented 2 years ago

Using cdrp_runs.txt and finding the plate name with plateid_barcode.txt I am able to generate the following lists of plates per batch.

Note that plateid_barcode.txt matches the list of plates in cellpainting-gallery EXCEPT plateid_barcode.txt is missing 24512. I have manually added 24512 into the batch that includes the plates around it numerically. According to Shantanu's comments (but NOT the code below) I have put plate 24512 in Batch 12.

Code
import pandas as pd
all_plates = ['24277','24278','24279','24280','24293','24294','24295','24296','24297','24300','24301','24302','24303','24304','24305','24306','24307','24308','24309','24310','24311','24312','24313','24319','24320','24321','24352','24357','24507','24508','24509','24512','24514','24515','24516','24517','24518','24523','24525','24560','24562','24563','24564','24565','24566','24583','24584','24585','24586','24588','24590','24591','24592','24593','24594','24595','24596','24602','24604','24605','24609','24611','24617','24618','24619','24623','24624','24625','24631','24633','24634','24635','24636','24637','24638','24639','24640','24641','24642','24643','24644','24645','24646','24647','24648','24651','24652','24653','24654','24655','24656','24657','24661','24663','24664','24666','24667','24683','24684','24685','24687','24688','24726','24731','24732','24733','24734','24735','24736','24739','24740','24750','24751','24752','24753','24754','24755','24756','24758','24759','24772','24773','24774','24775','24783','24785','24789','24792','24793','24795','24796','24797','25372','25374','25376','25378','25380','25382','25387','25391','25392','25403','25406','25408','25410','25414','25416','25418','25420','25422','25424','25426','25428','25430','25432','25434','25435','25436','25438','25485','25488','25490','25492','25503','25553','25564','25565','25566','25567','25569','25570','25571','25572','25573','25575','25576','25578','25579','25580','25581','25583','25584','25585','25587','25588','25590','25591','25592','25593','25594','25598','25599','25605','25638','25639','25641','25642','25643','25663','25664','25665','25667','25674','25675','25676','25677','25678','25679','25680','25681','25683','25686','25688','25689','25690','25692','25694','25695','25700','25704','25707','25708','25724','25725','25726','25732','25738','25739','25740','25741','25742','25847','25848','25849','25852','25853','25854','25855','25856','25857','25858','25859','25862','25885','25890','25891','25892','25903','25904','25908','25909','25911','25912','25913','25914','25915','25918','25923','25925','25929','25931','25935','25937','25938','25939','25943','25944','25945','25949','25955','25962','25965','25966','25967','25968','25983','25984','25985','25986','25987','25988','25989','25990','25991','25992','25993','25994','25997','26002','26006','26007','26008','26009','26058','26060','26061','26071','26081','26092','26107','26110','26115','26118','26124','26126','26128','26133','26135','26138','26140','26159','26166','26174','26181','26202','26203','26204','26205','26207','26216','26224','26232','26239','26247','26271','26521','26531','26542','26544','26545','26562','26563','26564','26569','26572','26574','26575','26576','26577','26578','26579','26580','26588','26592','26595','26596','26598','26600','26601','26607','26608','26611','26612','26622','26623','26625','26626','26640','26641','26642','26643','26644','26662','26663','26664','26666','26668','26669','26670','26671','26672','26673','26674','26675','26677','26678','26679','26680','26681','26682','26683','26684','26685','26688','26695','26702','26703','26705','26724','26730','26739','26744','26745','26748','26752','26753','26765','26767','26768','26771','26772','26785','26786','26794','26795']
id_bc = pd.read_csv('/Users/eweisbar/Documents/projects/30k/plateid_barcode.txt', sep='\t')
id_bc = id_bc.astype(str)
for item in all_plates:
    if item not in list(id_bc['PlateID']):
        print (f'{item} missing from plateid_barcode.txt')
cdrp_runs = pd.read_csv('/Users/eweisbar/Documents/projects/30k/cdrp_runs.txt', sep='\t')
df2 = cdrp_runs.merge(id_bc, how='outer',left_on='barcode', right_on='ASSAY_PLATE_BARCODE')
df2 = df2.append({'run_id':'2113-01-W01-01-02','date_started':'2011-03-16','PlateID':'24512'}, ignore_index=True)
df2.fillna('Unknown', inplace=True)
df2 = df2.astype(str)
df2 = df2.loc[df2['PlateID'].isin(all_plates)]
print ('batch, date_started, number of plates:')
for batch in df2['run_id'].unique():
    print (batch, (df2.loc[df2['run_id']==batch, 'date_started']).unique(), len(df2.loc[df2['run_id']==batch]))
df2.to_csv('/Users/eweisbar/Documents/projects/30k/plate_date.csv')

Lists of plates in batches **2113-01-W01-01-01** ['24305','24304','24303','24279','24302','24301','24295','24280','24294','24297','24300','24293','24307','24321','24320','24319','24278','24352','24309','24306','24357','24313','24312','24311','24310','24308','24277','24296'] **2113-01-W01-01-02** ['24507','24525','24518','24523','24517','24516','24515','24514','24509','24508'] **2113-01-W01-01-03** ['24560','24562','24563','24564','24565','24566','24583','24584','24585','24586','24588','24590','24591','24592','24593','24594','24618','24619','24623','24624','24625','24631','24633','24595','24596','24602','24604','24605','24611','24609','24617'] **2113-01-W01-01-04** ['24641','24640','24639','24638','24637','24636','24635','24634','24651','24648','24647','24646','24645','24644','24643','24642','24663','24661','24657','24656','24655','24654','24653','24652','24688','24687','24685','24684','24683','24667','24666','24664'] **2113-01-W01-01-06** ['25890','25909','25847','25908','25885','25862','25859','25857','25856','25855','25854','25853','25852','25849','25848','25931','25929','25925','25935','25918','25923','25915','25914','25913','25912','25904','25911','25903','25892','25891','25858'] **2113-01-W01-01-08** ['26625','26623','26622','26612','26611','26608','26607','26601','26563','26562','26545','26544','26542','26531','26521','26626','26600','26598','26596','26595','26592','26588','26580','26579','26578','26577','26576','26575','26574','26572','26569','26564'] **2113-01-W01-01-09** ['25949','25945','25944','25943','25939','25938','25937','25984','25983','25968','25967','25966','25965','25962','25955','25992','25991','25990','25989','25988','25987','25986','25985','26009','26008','26007','26006','26002','25997','25994','25993'] **2113-01-W01-01-11** ['26247','26239','26232','26224','26216','26207','26205','26204','26203','26202','26181','26174','26140','26271','26138','26166','26092','26061','26060','26058','26081','26110','26071','26107','26159','26135','26133','26128','26126','26124','26118','26115'] **2113-01-W01-01-12** ['25432','25430','25428','25406','25408','25410','25414','25416','25403','25392','25391','25387','25382','25380','25378','25376','25374','25372','25418','25420','25426','25424','25422','25434','25435','25485','25436','25488','25438','25490','25492','24512'] **2113-01-W01-01-13** ['25588','25567','25566','25565','25564','25553','25503','25599','25598','25594','25593','25592','25591','25590','25587','25585','25584','25605','25583','25581','25580','25579','25578','25576','25575','25573','25572','25571','25570','25569'] **2113-01-W01-01-14** ['25665','25664','25663','25643','25642','25641','25639','25638','25680','25679','25678','25677','25676','25675','25674','25667','25692','25724','25690','25689','25688','25686','25683','25681','25742','25741','25740','25739','25738','25732','25726','25725','25708','25707','25704','25700','25695','25694'] **2113-01-W01-01-15** ['26795','26794','26678','26677','26675','26674','26673','26672','26683','26745','26744','26771','26768','26669','26786','26682','26772','26681','26680','26671','26670','26668','26666','26664','26663','26662','26644','26643','26642','26688','26685','26684','26705','26695','26641','26640','26765','26752','26753','26785','26767','26679','26730','26724','26748','26703','26702','26739'] **16** ['24726','24731','24732','24733','24734','24735','24736','24739','24740','24750','24751','24752','24753','24754','24755','24756','24758','24759','24772','24773','24774','24775','24783','24785','24789','24792','24793','24795','24796','24797']
This leaves me the following number of plates parsed into batches. i run_id date_started n expected n identified plates
1 2113-01-W01-01-01 2011-03-16 28 28
2 2113-01-W01-01-02 2011-03-16 32 11
3 2113-01-W01-01-03 2011-03-21 32 31
4 2113-01-W01-01-04 2011-03-23 32 32
5 2113-01-W01-01-06 2011-04-11 32 31
6 2113-01-W01-01-08 2011-04-19 32 32
7 2113-01-W01-01-09 2011-04-21 32 31
8 2113-01-W01-01-11 2011-04-29 32 32
9 2113-01-W01-01-12 2011-05-24 32 31
10 2113-01-W01-01-13 2011-06-13 32 30
11 2113-01-W01-01-14 2011-06-21 40 38
12 2113-01-W01-01-15 2011-08-09 48 48
x 16 unknown 0 30

30 plates are not parsed into a batch. All of them are next to each other sequentially and so do not obviously parse into any of the above batches. Visually, they are similar to each other and not the other plates, so I am giving them their own label - Batch 16.

ErinWeisbart commented 2 years ago

I get the same results as my previous comment with platemap_barcode_plateid_cleanedup.csv

Reading through notes in the repo it looks like plate 24512 was accidentally processed as 25412. 25412 is in metadata and has Barcode AU00024335 which puts it in batch 2113-01-W01-01-12 2011-05-24. However, I'm guessing there was originally a 25412 acquired as numerically it would fit in that batch. It is NOT in the final list of plates included in the dataset and I think 24512 should belong in the chronological batch I placed it above.

For reprocessing, I plan on keeping 24512 labeled correctly but I want to confirm @shntnu this is okay as this will mean that the reprocessing data won't match the original data for the plate.

shntnu commented 2 years ago

@ErinWeisbart I remember my breaking my head over this typo :D

Reading through notes in the repo, it looks like plate 24512 was accidentally processed as 25412.

My guess is the opposite. Based on this note

https://github.com/broadinstitute/2015_Bray_GigaScience/blob/6d5e5e1b3119d6b42e9fac7cd1d58b0729276bd4/resolve_metadata.Rmd#L373-L374

and this one

https://github.com/broadinstitute/2015_Bray_GigaScience/blame/6d5e5e1b3119d6b42e9fac7cd1d58b0729276bd4/README.md#L36

my guess is that the "correct" id was 25412, but was mistakenly captured as 24512.

But for convenience, as noted in the README.md above, I used 24512 everywhere to avoid rework. To make things consistent, I created this file

https://github.com/broadinstitute/2015_Bray_GigaScience/blob/6d5e5e1b3119d6b42e9fac7cd1d58b0729276bd4/platemap_barcode_plateid_cleanedup_24512.csv#L52

in which I manually replaced 25412 with 24512.

This file assigns 24512 the barcode AU00024335, which, as you note, places it in 2113-01-W01-01-12.

Because there was apparently never actually a 24512, it doesn't make sense to put it in the batch that seems numerically more likely (2113-01-W01-01-02, the batch you have placed it in)

So for clarity, I think we should call this plate 24512 everywhere, assign it barcode AU00024335, and place it in 2113-01-W01-01-12

Phew

ErinWeisbart commented 2 years ago

Phew is right! Thanks for helping figure this out. I will keep the plate labelled 24512 as it is labelled in the gallery but will assign it AU00024335 and and place it in 2113-01-W01-01-12 (as it was actually supposed to be plate 25412).

I've noticed that the actual .tif names have different prefixes across plates and thought that perhaps that could validate the batching above and aid in assigning the 30 unparsed plates. However if I check the first and last plates (numerically) in each batch (as assigned above) they don't always match. Batch1 = cdp2bioactives Batch2 = cdp2bioactives Batch3 = cdp2bioactives, cdp2w3 Batch4 = cdp2w3 Batch6 = cdp2w6x2, cdp2w8x2 Batch8 = cdp2w7x2 Batch9 = cdp2w8x2

Going to just start looking at images rather than spend more time on the metadata!

shntnu commented 2 years ago

Going to just start looking at images rather than spend more time on the metadata!

đź’Ż

Thanks for trying to get to the bottom of it!

ErinWeisbart commented 2 years ago

Visually, plates in the unparsed set look like they belong together in a separate batch so I'm going to name them Batch 16 and have edited my comment above to reflect this.

For segmentation parameters I have divided them into 4 groups. Final analysis pipelines will be labeled according to the lower threshold on their ID Secondary: JUMP_analysis_30k_003 = Batch 1, 2, 12, 13, 14 JUMP_analysis_30k_00275 = Batches 3, 4, 6, 9 JUMP_analysis_30k_0025 = Batches 8, 11, 15 JUMP_analysis_30k_002 = Batch 16

Note that I have cleaned up the workspace folders so that the montages of the final parameters used, despite being initially selected across many batches of test segmentations, are now all contained in workspace/segment/.

Plates in JUMP_analysis_30k_003 (138 plates) ['24305','24304','24303','24279','24302','24301','24295','24280','24294','24297','24300','24293','24307','24321','24320','24319','24278','24352','24309','24306','24357','24313','24312','24311','24310','24308','24277','24296','24507','24525','24518','24523','24517','24516','24515','24514','24509','24508','25432','25430','25428','25406','25408','25410','25414','25416','25403','25392','25391','25387','25382','25380','25378','25376','25374','25372','25418','25420','25426','25424','25422','25434','25435','25485','25436','25488','25438','25490','25492','24512','25588','25567','25566','25565','25564','25553','25503','25599','25598','25594','25593','25592','25591','25590','25587','25585','25584','25605','25583','25581','25580','25579','25578','25576','25575','25573','25572','25571','25570','25569','25665','25664','25663','25643','25642','25641','25639','25638','25680','25679','25678','25677','25676','25675','25674','25667','25692','25724','25690','25689','25688','25686','25683','25681','25742','25741','25740','25739','25738','25732','25726','25725','25708','25707','25704','25700','25695','25694']
Plates in JUMP_analysis_30k_00275 (125 plates) ['24560','24562','24563','24564','24565','24566','24583','24584','24585','24586','24588','24590','24591','24592','24593','24594','24618','24619','24623','24624','24625','24631','24633','24595','24596','24602','24604','24605','24611','24609','24617','24641','24640','24639','24638','24637','24636','24635','24634','24651','24648','24647','24646','24645','24644','24643','24642','24663','24661','24657','24656','24655','24654','24653','24652','24688','24687','24685','24684','24683','24667','24666','24664','25890','25909','25847','25908','25885','25862','25859','25857','25856','25855','25854','25853','25852','25849','25848','25931','25929','25925','25935','25918','25923','25915','25914','25913','25912','25904','25911','25903','25892','25891','25858','25949','25945','25944','25943','25939','25938','25937','25984','25983','25968','25967','25966','25965','25962','25955','25992','25991','25990','25989','25988','25987','25986','25985','26009','26008','26007','26006','26002','25997','25994','25993']
Plates in JUMP_analysis_30k_0025 (112 plates) ['26625','26623','26622','26612','26611','26608','26607','26601','26563','26562','26545','26544','26542','26531','26521','26626','26600','26598','26596','26595','26592','26588','26580','26579','26578','26577','26576','26575','26574','26572','26569','26564','26247','26239','26232','26224','26216','26207','26205','26204','26203','26202','26181','26174','26140','26271','26138','26166','26092','26061','26060','26058','26081','26110','26071','26107','26159','26135','26133','26128','26126','26124','26118','26115','26795','26794','26678','26677','26675','26674','26673','26672','26683','26745','26744','26771','26768','26669','26786','26682','26772','26681','26680','26671','26670','26668','26666','26664','26663','26662','26644','26643','26642','26688','26685','26684','26705','26695','26641','26640','26765','26752','26753','26785','26767','26679','26730','26724','26748','26703','26702','26739']
Plates in JUMP_analysis_30k_002 (30 plates) ['24726','24731','24732','24733','24734','24735','24736','24739','24740','24750','24751','24752','24753','24754','24755','24756','24758','24759','24772','24773','24774','24775','24783','24785','24789','24792','24793','24795','24796','24797']
ErinWeisbart commented 1 year ago

Analysis done. Backends made. Do I hand it off now? (I had been handing off JUMP Production at this point to Niranj) Or are you wanting me to make profiles? @shntnu @annecarpenter

shntnu commented 1 year ago

We discussed in Slack that @niranjchandrasekaran would take it from here i.e. create the profiles and run the validation script

@ErinWeisbart I don't know if there's already a process for completing this step:

Metadata completely filled out in Project Profiler Database (Imaging Platform internal use only)

I know that Rebecca was working on creating a process; could you check and let us know what's needed?

ErinWeisbart commented 1 year ago

@shntnu IP Project Profiler is now a lovely Airtable thanks to Rebecca. I don't think we have detailed usage instructions for it, so I'll add that to my list to make sure we get documented better.

This is already in the IPPP database as 2008 12 04 Imaging CDRP for MLPCN (Imaging Platform) aka p_08cdrp. I can get the database updated to include all the information we have about this project. I suggest that I enter this re-analysis into the database as a new batch "CDRP re-analysis" - there is already "CDRP" and "CDRP Pilot" and this will allow us to keep pointers to the original analysis and the new analysis separate.

This raises a new question - the old data has a non-conforming structure - gigascience_profiles and gigascience_upload_targz are the workspace folders. Should I unzip the old files and arrange them in folders that fit our gallery structure or are we okay with the old data being kept as-is?

shntnu commented 1 year ago

Thanks Erin

Entering the reanalysis as a new batch makes perfect sense.

I don’t think we care much about formatting the old version of the data to suit, so let’s leave that as is. In general, we want to avoid deviations from the canonical folder structure, but this is data we aren’t going to access again; it’s kept around just in case we ever want to go back to the old data (rare). Thanks for bringing this up!

niranjchandrasekaran commented 1 year ago

As I am getting ready to generate the profiles, I have a few questions (may have other questions later)

shntnu commented 1 year ago
  • @shntnu Is there going to be a data repo for this dataset? If so, in which GitHub org should it be?

Yes, let's use broadinstitute because it fits better there vs jump-cellpainting (because it wasn't created through JUMP).

  • @shntnu Profiles will be versioned using .dvc, right?

Yes

The reason I ask is that all the production data is currently not versioned, so I wanted to check what I should do with this dataset.

Correct, https://github.com/jump-cellpainting/jump-orf-data/ is for Broad data, but we don't have a repo for the Compound data. This is because each partner processed their data separately. We (Broad) will create a single data repo for all the profiles in jump-cellpainting.

ErinWeisbart commented 1 year ago

I am looking for all the following mapping information. Is project profiler database the place to find them?

No, our current process means that the project profiler database is not where we will upload such information. For this project, it's all available (hopefully) in https://github.com/broadinstitute/2015_Bray_GigaScience as this is just a re-running of a previously published dataset.

The backend/ folder contains all the plates, and they don't seem to separated by batches. When I create profiles, is that how I should also structure the data?

Yes. I organized them into my best guess of batches for the purposes of determining segmentation parameters, but I don't actually know they map to actual acquisition batches (as we couldn't figure that out) so best to keep them all in 1 batch.

niranjchandrasekaran commented 1 year ago

@shntnu Since there is only one batch, when I create batch level feature selected profiles, it will assume that all plates came from the same batch. Just to be on the safer side, shall I create only the whole experiment-level feature selected profiles and then if the batch information is figured out, someone can create batch-level feature selected profiles, later? (Note: in the case, both types of feature selection will generate the same output, but the profiling-recipe will give them different names. This could avoid confusion in case this dataset does have batches).

Perfect

@shntnu 2022_07_13_CDRP is a good name for the repo?

2015_Bray_GigaScience-data so that there is some link with the other repo (2015_Bray_GigaScience) and so that we follow the de facto nomenclature (*-data) for data repos

niranjchandrasekaran commented 1 year ago

@ErinWeisbart There are 405 folders in the backend folder, but this file in the Gigascience repo seems to have 406 plates. Plate 25568 seems to be missing. I looked at your script in https://github.com/broadinstitute/cellpainting-gallery/issues/13#issuecomment-1224839584 and it also seems to be missing this plate. Do you know anything about this plate?

ErinWeisbart commented 1 year ago

I don't know anything about why it is missing, but it is not in the /images folder. (There are 405 plates in /images)

niranjchandrasekaran commented 1 year ago

@shntnu Do you happen to know anything about this plate? If not, is it ok to ignore it?

AnneCarpenter commented 1 year ago

A search of the whole repo (and the Bray gigascience PDF!) for the plate number didn't yield anything other than the metadata for the plate.

If we can't find the images, we can definitely skip. The images generally seem available, ie here for the plate: http://www.cellimagelibrary.org/images/45862 so in theory we could grab them and process but I think that increases hassle significantly.

niranjchandrasekaran commented 1 year ago

Thanks Anne!

Based on various files in the Gigascience repo, these are the number of plates in this dataset

File/Folder Number of plates
cdrp_metadata_from_gigadb_100200 413
barcode_platemap_25412.csv 413
platemap_barcode_plateid_cleanedup_24512.csv 406
plates_all.txt 413
backend and images on S3 405

This file suggests that 7 plates were removed from gigascience, which brings the number of plates down from 413 to 406. I cannot find if 25568 was also removed along with the other 7 plates.

Given that the images are also missing, for the purposes of this exercise, I am going to assume that there are only 405 plates and begin processing the profiles. I will also make a note of this in the new repo that will be created.

shntnu commented 1 year ago

Given that the images are also missing, for the purposes of this exercise, I am going to assume that there are only 405 plates and begin processing the profiles

Good plan

@niranjchandrasekaran FYI this note is a good source of historical info about this dataset https://broadinstitute.atlassian.net/wiki/spaces/IP/pages/114638720/2017-04-19+CDP2+data+show+decent+quality+both+for+bioactive+and+DOS+compounds (but it does not explain the missing plate)

niranjchandrasekaran commented 1 year ago

I have processed the profiles and uploaded the csv.gz files to s3://cellpainting-gallery/cpg0012-wawer-bioactivecompoundprofiling. The profiles are also in https://github.com/broadinstitute/2015_Bray_GigaScience-data and the dvc files have been synced to S3. I ran John's validation script and the following is the summary of the files output by his script.

outputs.zip

image_counts.csv

All plates have 11520 images except for the following

    dataset_id batch_id  plate_id  num_images
65       broad     CDRP     24623        7240
154      broad     CDRP     25432       11515
225      broad     CDRP     25732        2365
325      broad     CDRP     26521       11195
326      broad     CDRP     26531       11375
329      broad     CDRP     26545       11345
332      broad     CDRP     26564        2250
333      broad     CDRP     26569         500
334      broad     CDRP     26572        1750
335      broad     CDRP     26574        2750
336      broad     CDRP     26575        2710
337      broad     CDRP     26576        3270
338      broad     CDRP     26577        7020
339      broad     CDRP     26578        8520
340      broad     CDRP     26579       10160
341      broad     CDRP     26580       11270
343      broad     CDRP     26592       11270
344      broad     CDRP     26595       11480
345      broad     CDRP     26596       10955
346      broad     CDRP     26598       11065
347      broad     CDRP     26600       11315
348      broad     CDRP     26601       11515
356      broad     CDRP     26626       11330

@shntnu @ErinWeisbart Is this consistent with what you know about this dataset?

batch.csv

John, I am getting the following message even though the metadata and platemap files are there. Is there a reason why this may be happening?

object_id,batch_id,date,platemaps,barcode_platemap,message_00,message_01
CDRP,CDRP,CDRP,[],,Batch CDRP without barcode_platemap,Batch CDRP without platemaps

plate.csv

All plates generate the following errors IllumBrightfield channel not in the illumination correction IllumBrightfield_H channel not in the illumination correction

@ErinWeisbart Does this make sense?

They also generate the following errors normalized_feature_select_batch not in profiles normalized_feature_select_negcon_batch not in profiles

These are fine as I only generated normalized_feature_select_all and normalized_feature_select_negcon_all profiles and the code doesn't look for those files.

unknown_object.csv

The script complains about the presence all the following types of files

- *DVC/ folder
- *_all.csv.gz files
- pipelines/ folder
- gigascience_upload_targz/ folder
- gigascience_profiles/ folder
- segment/ folder

I am confident that all these folders and files should exist, and we don't have to remove any of them.

Should we run the feature validation script on this dataset? If so, can you run it, John?

@shntnu Can you add John to this repo and tag him so that he sees these questions?

ErinWeisbart commented 1 year ago

It is correct that not all plates had a full complement of images in them in S3 for me to re-process (I did not go back and match to original acquisition/processing).

There are no brightfield images in this dataset so it is fine that there are no IllumBrightfield images

shntnu commented 1 year ago

@shntnu Can you add John to this repo and tag him so that he sees these questions?

@johnarevalo 🎉

johnarevalo commented 1 year ago

@niranjchandrasekaran I re-ran the validation scripts, I'm getting the same output you get without the batch.csv file. Perhaps some change I did in the last commits fixed it.

The feature validation just checks that names across partners match. Should this dataset also have same feature names as in jump partners?.

ErinWeisbart commented 1 year ago

@johnarevalo it is the same pipeline (other than removing brightfield channels) so the features names should match.

niranjchandrasekaran commented 1 year ago

Thanks for confirming, Erin. In that case, we won't have to do feature validation.

I'm getting the same output you get without the batch.csv file. Perhaps some change I did in the last commits fixed it.

Thanks for checking, John. It was a user error. I had previously run the validation scripts when the metadata files were not on S3, and I forgot to delete the file before re-running the validation script.

We can now mark “Profiling complete” and “Run validation script to ensure completion” as complete (I don't have the permission to mark them as complete). cc @shntnu

shntnu commented 1 year ago

We can now mark “Profiling complete” and “Run validation script to ensure completion” as complete (I don't have the permission to mark them as complete). cc @shntnu

Done

bethac07 commented 1 year ago

We are not certain exactly which commit was used of pycytominer for aggregation, but it was either 0a0ff15 or f3206da Changes that were made locally when running collate for CDRP:

ubuntu@ip-172-31-44-207:~/ebs_tmp/cpg0012-wawer-bioactivecompoundprofiling/workspace/software/pycytominer$ git diff
diff --git a/pycytominer/cyto_utils/collate.py b/pycytominer/cyto_utils/collate.py
index 6fa3ada..3f2fa4c 100644
--- a/pycytominer/cyto_utils/collate.py
+++ b/pycytominer/cyto_utils/collate.py
@@ -99,7 +99,7 @@ def collate(
-            sync_cmd = f'aws s3 sync --exclude "*" --include "*/Cells.csv" --include "*/Nuclei.csv" --include "*/Cytoplasm.csv" --include "*/Image.csv" {remote_input_dir} {input_dir}'
+            sync_cmd = f'aws s3 sync --exclude "*" --include "*/Cells.csv" --include "*/Nuclei.csv" --include "*/Cytoplasm.csv" --include "*/Image.csv" {remote_input_dir} {input_dir} --no-sign-request'
@@ -161,7 +161,7 @@ def collate(
-            cp_cmd = ["aws", "s3", "cp", cache_backend_file, remote_backend_file]
+            cp_cmd = ["aws", "s3", "cp", cache_backend_file, remote_backend_file,"--profile","jump-cp-role-jump-cellpainting","--acl","bucket-owner-full-control","--metadata-directive","REPLACE"]
@@ -184,7 +184,7 @@ def collate(
-        cp_cmd = ["aws", "s3", "cp", remote_backend_file, backend_file]
+        cp_cmd = ["aws", "s3", "cp", remote_backend_file, backend_file,"--profile","jump-cp-role-jump-cellpainting","--acl","bucket-owner-full-control","--metadata-directive","REPLACE"]
@@ -210,7 +210,7 @@ def collate(
-        csv_cp_cmd = ["aws", "s3", "cp", aggregated_file, remote_aggregated_file]
+        csv_cp_cmd = ["aws", "s3", "cp", aggregated_file, remote_aggregated_file,"--profile","jump-cp-role-jump-cellpainting","--acl","bucket-owner-full-control","--metadata-directive","REPLACE"]

To run:

parallel --max-procs ${MAXPROCS} --ungroup --eta --joblog ../../log/${BATCH_ID}/collate.log --results ../../log/${BATCH_ID}/collate --files --keep-order python3 pycytominer/cyto_utils/collate_cmd.py ${BATCH_ID}  pycytominer/cyto_utils/database_config/ingest_config.ini {1} --tmp-dir ~/ebs_tmp --aws-remote=s3://${BUCKET}/${PROJECT_NAME}/broad/workspace :::: ${PLATES}

Check backends:

aws s3 ls s3://cellpainting-gallery/cpg0012-wawer-bioactivecompoundprofiling/broad/workspace/backend/CDRP/
shntnu commented 1 year ago

Changes that were made locally when running collate for CDRP:

Thanks, Beth

I've noted in the top post the repo that Niranj used to make/store all the steps downstream of aggregate: https://github.com/broadinstitute/2015_Bray_GigaScience-data. So everything downstream of the aggregate step is fully reproducible by via that repo.

Everything upstream of collate/aggregate is fully reproducible using the CellProfiler pipelines, stored in s3://cellpainting-gallery/cpg0012-wawer-bioactivecompoundprofiling/broad/workspace/pipeline but also in the Experiment.csv files present in s3://cellpainting-gallery/cpg0012-wawer-bioactivecompoundprofiling/broad/workspace/analysis.

The collate/aggregate step is documented in https://github.com/broadinstitute/cellpainting-gallery/issues/13#issuecomment-1283028066, so I think that covers everything.

shntnu commented 1 year ago

As documented in https://github.com/jump-cellpainting/datasets/issues/33, the LoadData CSV files had an error (AGP = Mito), and this may have trickled downstream. @ErinWeisbart – does that sound right? Perhaps a bug when creating the LoadData CSVs? Feel free to use this or the other issue to discuss and track

ErinWeisbart commented 1 year ago

Trying to get a paper out by EOD today, but will make digging into this my top priority tomorrow morning. At a glance, yes, it looks like I had a mistake in CSV creation which, to my horror, has serious downstream repercussions.

ErinWeisbart commented 1 year ago

As of today 2023_02_07 this experiment has been fixed in the cellpainting gallery - new load_data.csvs, analysis, backends, and profiles.