Deciding on datasets to use

shntnu commented 2 years ago

@EchteRobert asked

last week I believe you mentioned that we should perhaps pick a different dataset to start than the Stain5(?) dataset. Do you remember which one(s) you had in mind instead?

@niranjchandrasekaran had said:

Here are some options Stain2 and Stain3 - Lots of different experimental conditions; 1 plate per condition (no replicates). These two could be good datasets to burn through while Robert is coming up with his methods. Stain4, Plate1, Reagent1 and Stain5 - Lots of different conditions; 3 or 4 replicate plates per condition. I guess Stain5 is the best dataset, so it could perhaps be used as a holdout set.

Based on this, I vote for starting with Stain2, @EchteRobert

EchteRobert commented 2 years ago

LGTM. I see that Percent Strong is used. Could you remind me what that considers as replicates? I can't find it in the issue https://github.com/jump-cellpainting/pilot-analysis/issues/15

shntnu commented 2 years ago

Here's the analysis https://github.com/jump-cellpainting/pilot-analysis/blob/master/1.cpjump-stain2/0.analyze-cpjump-stain2.ipynb

Does that help? Otherwise, might need @niranjchandrasekaran to clarify

niranjchandrasekaran commented 2 years ago

@EchteRobert Percent Strong in that analysis is the same as Percent Replicating that you have been using so far. You can use Metadata_broad_sample as the grouping feature for grouping replicates.

EchteRobert commented 2 years ago

Great! Thank you both!

EchteRobert commented 2 years ago

@niranjchandrasekaran I can't find the platemap or metadata for Stain2 on the pilot-analysis GitHub. It seems to have been removed as the notebook refers to one of those pages. Do you know if I can find it somewhere else?

niranjchandrasekaran commented 2 years ago

@EchteRobert Here are the platemap and metadata files - https://github.com/jump-cellpainting/JUMP-MOA

EchteRobert commented 2 years ago

@shntnu @niranjchandrasekaran This is the list I made of all the feature columns in the profiles that are available in the aws s3 cp s3://cellpainting-gallery/jump-pilot/source_4/workspace/backend/ bucket:

BR00112197binned.csv - 4295 columns - 4293 features BR00112200.csv - 3530 columns - 3528 features BR00112203.csv - 4293 features BR00112199.csv - 4293 features BR00113818.csv - 4293 features BR00113819.csv - 4293 features BR00113820.csv - 4293 features BR00113821.csv - 4293 features BR00112197repeat.csv - 4293 features BR00112197standard.csv - 4293 features BR00112198.csv - 4293 features BR00112201.csv - 4293 features BR00112202.csv - 4293 features BR00112204.csv - 4293 features

Should I just remove the BR00112200 plate from my data pool and then moving forward expect that the other Stain experiments will have the same 4293 features? Or do you think that these features will change?

niranjchandrasekaran commented 2 years ago

According to this issue plate BR00112200 only has 4 channels, which explains the number of features.

@EchteRobert I believe, between Stain4 and Stain5, we updated the feature extraction pipeline and therefore the number of features changes in Stain5 (5794 columns; it is likely that CPJUMP1 also has the same number of columns). Hence Stain2-4 will likely have 4293 features but if it is not too difficult, I would suggest that you quickly check all the plates in Stain3 and 4 before proceeding further.

shntnu commented 2 years ago

Thanks @niranjchandrasekaran

@EchteRobert here's a way to do it quickly

This command

aws s3 ls --recursive s3://cellpainting-gallery/jump-pilot/source_4/workspace/backend/|grep backend|grep csv|grep Stain|tr -s " "|cut -d" " -f4|parallel --keep-order "echo -n {1}; aws s3 cp  s3://cellpainting-gallery/{1} -|csvcut -n|wc -l"|grep -v "download failed"|tr -s " "|tr " " ","|csvcut -c 2,1|sort -n  > ~/Desktop/stain.csv

produces stain.csv, and then counting reveals that all Stain5 are 5794 columns as Niranj said, and the remaining indeed all have 4295 columns (i.e. 4293 features + 2 metadata columns)

cat ~/Desktop/stain.csv |csvcut -c1|sort|uniq -c
   1 3530
  60 4295
  60 5794

EchteRobert commented 2 years ago

Whoa! I didn't know such magic was possible with aws! Thank you, that saves me some time as I was going to do it manually... I'll divide the data in those two groups then

shntnu commented 2 years ago

All that magic is bash, not AWS :) (except the bit about aws s3 cp <object> - to cat the <object>)

carpenter-singh-lab / 2024_vanDijk_PLoS_CytoSummaryNet

Deciding on datasets to use #2