Open shntnu opened 2 years ago
LGTM. I see that Percent Strong is used. Could you remind me what that considers as replicates? I can't find it in the issue https://github.com/jump-cellpainting/pilot-analysis/issues/15
Here's the analysis https://github.com/jump-cellpainting/pilot-analysis/blob/master/1.cpjump-stain2/0.analyze-cpjump-stain2.ipynb
Does that help? Otherwise, might need @niranjchandrasekaran to clarify
@EchteRobert Percent Strong in that analysis is the same as Percent Replicating that you have been using so far. You can use Metadata_broad_sample
as the grouping feature for grouping replicates.
Great! Thank you both!
@niranjchandrasekaran I can't find the platemap or metadata for Stain2 on the pilot-analysis GitHub. It seems to have been removed as the notebook refers to one of those pages. Do you know if I can find it somewhere else?
@EchteRobert Here are the platemap and metadata files - https://github.com/jump-cellpainting/JUMP-MOA
@shntnu @niranjchandrasekaran This is the list I made of all the feature columns in the profiles that are available in the aws s3 cp s3://cellpainting-gallery/jump-pilot/source_4/workspace/backend/ bucket:
BR00112197binned.csv - 4295 columns - 4293 features BR00112200.csv - 3530 columns - 3528 features BR00112203.csv - 4293 features BR00112199.csv - 4293 features BR00113818.csv - 4293 features BR00113819.csv - 4293 features BR00113820.csv - 4293 features BR00113821.csv - 4293 features BR00112197repeat.csv - 4293 features BR00112197standard.csv - 4293 features BR00112198.csv - 4293 features BR00112201.csv - 4293 features BR00112202.csv - 4293 features BR00112204.csv - 4293 features
Should I just remove the BR00112200 plate from my data pool and then moving forward expect that the other Stain experiments will have the same 4293 features? Or do you think that these features will change?
According to this issue plate BR00112200
only has 4 channels, which explains the number of features.
@EchteRobert I believe, between Stain4 and Stain5, we updated the feature extraction pipeline and therefore the number of features changes in Stain5 (5794 columns; it is likely that CPJUMP1 also has the same number of columns). Hence Stain2-4 will likely have 4293 features but if it is not too difficult, I would suggest that you quickly check all the plates in Stain3 and 4 before proceeding further.
Thanks @niranjchandrasekaran
@EchteRobert here's a way to do it quickly
This command
aws s3 ls --recursive s3://cellpainting-gallery/jump-pilot/source_4/workspace/backend/|grep backend|grep csv|grep Stain|tr -s " "|cut -d" " -f4|parallel --keep-order "echo -n {1}; aws s3 cp s3://cellpainting-gallery/{1} -|csvcut -n|wc -l"|grep -v "download failed"|tr -s " "|tr " " ","|csvcut -c 2,1|sort -n > ~/Desktop/stain.csv
produces stain.csv, and then counting reveals that all Stain5 are 5794 columns as Niranj said, and the remaining indeed all have 4295 columns (i.e. 4293 features + 2 metadata columns)
cat ~/Desktop/stain.csv |csvcut -c1|sort|uniq -c
1 3530
60 4295
60 5794
Whoa! I didn't know such magic was possible with aws! Thank you, that saves me some time as I was going to do it manually... I'll divide the data in those two groups then
@EchteRobert asked
@niranjchandrasekaran had said:
Based on this, I vote for starting with Stain2, @EchteRobert