broadinstitute / cmQTL

High-dimensional phenotyping to define the genetic basis of cellular morphology
BSD 3-Clause "New" or "Revised" License
6 stars 0 forks source link

Process new batch 2020_07_22_Batch7 #48

Closed mtegtmey closed 3 years ago

mtegtmey commented 4 years ago

@shntnu @gwaygenomics @jatinarora-upmc Images are being transferred now! Sorry for the delay.

mtegtmey commented 4 years ago

Data transfer is complete.

Data is at /imaging/analysis/2018_06_05_cmQTL/2020_03_05_Batch6

jatinarora-upmc commented 4 years ago

@mtegtmey where is this location?

shntnu commented 4 years ago

@jatinarora-upmc I'd need to process and upload to AWS S3 it before you can access the profiles

jatinarora-upmc commented 4 years ago

@mtegtmey @shntnu yeah, i don't have access to Broad. So, i would wait for AWS. Thanks so much guys !

jatinarora-upmc commented 4 years ago

@shntnu hi Shantanu, i am not sure if this github update is for me, and also i am not very familiar with github processes, so could you please update me when the upload for the data for plate 7 is done?

shntnu commented 4 years ago

@shntnu hi Shantanu, i am not sure if this github update is for me, and also i am not very familiar with github processes, so could you please update me when the upload for the data for plate 7 is done?

Sure thing.

FYI – the reason you got the notification is that you had previously participated in this thread, and so you get notified by default each time there is an activity in this thread. The most recent activity is the draft pull request (PR) I created #49. When merged, this PR will bring in the processed data (and some documentation) into the repo. It is a draft PR, meaning that it isn't yet ready. Hope this helps :)

Regarding the data itself – I noticed that the last few steps didn't complete because the data size was unusually large (94.9 GB vs 22.8 GB from the last run of this plate). I haven't looked into the images so I don't know why that's the case, but for now, I have increased the disk size and rerun. I'll then tag Beth, requesting her to peek in to see if there's an obvious explanation.

shntnu commented 4 years ago

@jatinarora-upmc All set. The profiles are in the usual locations

shntnu commented 4 years ago

This plate has a lot of cells but it doesn't seem very different from plates we have seen in the past image image

See https://github.com/broadinstitute/cmQTL/blob/master/1.profile-cell-lines/5.inspect-all-profiles.md

jatinarora-upmc commented 4 years ago

Thanks so much Shantanu @shntnu , all seems pretty good for now. I am qc-ing this plate 7, i will let you know how the outcome is...

jatinarora-upmc commented 4 years ago

@shntnu @mtegtmey guys, i am doing QC for plate 7, i noticed 11% of cells have NA in ~2000 features. Is it possible?? I am concerned because not even 1% of cells have NA across so many features for previous 6 plates.

shntnu commented 4 years ago

@jatinarora-upmc one would need to repeat the analysis of https://github.com/broadinstitute/cmQTL/blob/master/1.profile-cell-lines/8.inspect-plate-7.md to figure out what's going on. I'll try getting to it before our Tuesday meeting.

shntnu commented 4 years ago

@jatinarora-upmc Of the 4910 cells I sampled, 72 cells had NA's in all 1875 features. Once these are excluded, you have very few if any features with NAs, other than the correlation features.

jatinarora-upmc commented 4 years ago

@shntnu , yep that's right. I find it strange compared to previous plates. Overall, 11% of such cells with NA in 1875 features -- compared to almost no cells with NA across even 10 features on previous plates. I am concerned why it is like this, and am not sure if I should include this plate 7 for now. What do you think? Have you already seen this sort of thing with any other project?

shntnu commented 4 years ago

I noticed only 72/4910 = ~1% of cells had NA's in 1875 features. Once you exclude this 1%, you are left with only 1 cell (in my sample) that had NA in any feature. Details are here https://github.com/broadinstitute/cmQTL/blob/master/1.profile-cell-lines/9.inspect-new-plate-7.md

Are you sure the number is 11%?

jatinarora-upmc commented 4 years ago

@shntnu I saw that 1,151,570 out of 1,289,440 cells survived my usual QC, which is loss of 11% cells.

shntnu commented 4 years ago

Is it behaving differently from what you saw here https://github.com/broadinstitute/cmQTL/issues/30#issuecomment-622193208?

jatinarora-upmc commented 4 years ago

@shntnu yep, it behaves very differently in that there are way more cells now (11%) with NA across ~2k features. Overall, post-qc, there are enough number of cells (~1.1M), but given this NAs, i am not sure if the cell feature measurements are correct or not.

shntnu commented 4 years ago

Really strange that our analyses are not matching up. To clarify, in the analysis here https://github.com/broadinstitute/cmQTL/blob/master/1.profile-cell-lines/9.inspect-new-plate-7.md, if I drop the "bad" cells (~1%) i.e. cells with 1875 NA-valued features, I am left with exactly 1 cell that has an NA in any feature.

But if I understand correctly, you are saying that you have 11% of the cells that have NA in 1875 features. And perhaps many more that have NA in any feature. Is that right?

jatinarora-upmc commented 4 years ago

yep, there are 11% of cells with NA in those one or more of those 1875 features. i am concerned only because this is so different from what i saw for previous plates (only few hundred bad cells).

I sent you number of NAs for first 10k cells -- may be that can help.

shntnu commented 4 years ago

I think I'm confused :D But we might be getting close to figuring this out 👍

Do 11% of cells have NA in

shntnu commented 4 years ago

Also, inspecting your sampling, here's what I see

plate7_na_features %>% count(number_of_na) %>% knitr::kable()
number_of_na n
0 1867
8 1
18 1
33 2420
39 1
46 2
51 1
56 1

which suggests there may be just 33 "bad" cells in your sample

shntnu commented 4 years ago

Let's figure out the rest over a call :) Pretty sure the answer is just around the corner.

jatinarora-upmc commented 4 years ago

Am sorry for confusing you. These 11% of cells have NA in one or more of 1875 features. You are also right that, there are fewer cells with NA across all 1875 features. My only doubt is that is there any thing specific to this plate 7, because i did not see so many cells having NA in one or more features on previous plates?

mtegtmey commented 4 years ago

On my end, there was no change to any of the cell culture work prior to imaging.

On Aug 17, 2020, at 8:43 PM, Jatin Arora notifications@github.com wrote:

Am sorry for confusing you. These 11% of cells have NA in one or more of 1875 features. You are also right that, there are fewer cells with NA across all 1875 features. My only doubt is that is there any thing specific to this plate 7, because i did not see so many cells having NA in one or more features on previous plate?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cmQTL/issues/48#issuecomment-675185178, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMSE5ERSTHA7P2OFNOYEHPDSBHFDVANCNFSM4PKVRARQ.

shntnu commented 4 years ago

Am sorry for confusing you. These 11% of cells have NA in one or more of 1875 features. You are also right that, there are fewer cells with NA across all 1875 features. My only doubt is that is there any thing specific to this plate 7, because i did not see so many cells having NA in one or more features on previous plates?

Cool! We can rapidly figure out the rest over a call tomorrow

shntnu commented 4 years ago

i did not see so many cells having NA in one or more features on previous plates?

Also note this https://github.com/broadinstitute/cmQTL/issues/30#issuecomment-611036214

i.e. in this plate, you did see 23% of cells being flagged by your analysis, but we later figured out how to address. I suspect it's the same (solvable) issue here i.e. removing a handful of cells + removing some correlation features. More when we chat

jatinarora-upmc commented 4 years ago

image image @shntnu here i put the distribution of NAs in 25k randomly sampled cells.

I think we should cross verify with the images of plate 7. Perhaps, there is some problem there, such as light exposure you suggested.

shntnu commented 4 years ago

@jatinarora-upmc Can you attach the files with the exact counts for these two plots? So that's

shntnu commented 4 years ago

For my notes, @bethac07 said we should check whether the cells with NA are localized in a few images.

jatinarora-upmc commented 4 years ago

In the summary stat i posted here yesterday, and for the above plots, i did not exclude costes and correlation features. I will post some summary stats again after excluding them.

shntnu commented 4 years ago

I haven't been able to get to this yet

@jatinarora-upmc said

I checked the distribution of features on plate 7, they looked fine, and so I included it in my analysis. However, I think we should still figure out why we had a larger proportion of bad cells compared to previous plates.

so at least this is not blocking right now.

We can decide after our next call whether it is worth the effort to revisit this.

shntnu commented 3 years ago

We can decide after our next call whether it is worth the effort to revisit this.

I don't think we decided one way or the other. But I'll go ahead and close this now and reopen if we decide to probe further