FredHutch / gimap

Genetic Interaction MAPping for dual target CRISPR screens
https://fredhutch.github.io/gimap/
0 stars 0 forks source link

Annotation schema #25

Closed cansavvy closed 3 months ago

cansavvy commented 5 months ago
# Description Starting to dig into the code and address #5 So far there are three files that are needed from depmap. The package itself is being quite glitchy and we don't really need it because we can just download the files directly ourselves https://www.bioconductor.org/packages/release/data/experiment/manuals/depmap/man/depmap.pdf All files are downloaded from: https://figshare.com/articles/dataset/DepMap_22Q2_Public/19700056 From Depmap we download: - Sample metadata (where we connect what cell line is being called -- this is its own argument) - Control genes list - TPM data - CN data And we also have the Berger Lab PGRNA file downloaded from: https://media.addgene.org/cms/filer_public/a9/9a/a99a9328-324b-42ff-8ccc-30c544b899e4/pgrna_library.xlsx Still working on figuring out some of the details and in a different time on a different PR we'll have to figure out how to make this more flexible for if people are not using the Berger lab pgRNAid library. ## Review request 1. Obviously just seeing if this makes sense and does what we need it to do as compared to the original is great. 2. Testing it manually using the example code in the documentation is also good. 3. Any recommendations for making the code cleaner, clearer, or just overall better is also welcome of course!
kweav commented 3 months ago

Thank you for doing all of the simplifications! Given there are so many more steps in the original than this, what's the best way to know if this simplified product recapitulates the main steps of the original analysis? I think I found the RDS file that was saved from the original analysis and can try to compare to that, but wanted to check if you have anything else in mind?

kweav commented 3 months ago

And when I tried to run the code as specified in the header (but without the filter function), I got this error

Error in `dplyr::left_join()`:
! Join columns in `x` must be present in
  the data.
✖ Problem with `gene1_symbol`.
---
Backtrace:
    ▆
 1. ├─gimap_dataset %>% gimap_annotate()
 2. ├─global gimap_annotate(.)
 3. │ └─... %>% ...
 4. ├─dplyr::left_join(...)
 5. ├─dplyr::left_join(., depmap_cn, by = c(gene1_symbol = "genes"))
 6. └─dplyr:::left_join.data.frame(., depmap_cn, by = c(gene1_symbol = "genes"))
Run rlang::last_trace(drop = FALSE) to see 4 hidden frames.
kweav commented 3 months ago

And when I tried to run the code as specified in the header (but without the filter function), I got this error

Error in `dplyr::left_join()`:
! Join columns in `x` must be present in
  the data.
✖ Problem with `gene1_symbol`.
---
Backtrace:
    ▆
 1. ├─gimap_dataset %>% gimap_annotate()
 2. ├─global gimap_annotate(.)
 3. │ └─... %>% ...
 4. ├─dplyr::left_join(...)
 5. ├─dplyr::left_join(., depmap_cn, by = c(gene1_symbol = "genes"))
 6. └─dplyr:::left_join.data.frame(., depmap_cn, by = c(gene1_symbol = "genes"))
Run rlang::last_trace(drop = FALSE) to see 4 hidden frames.

Ok, this is because the annotation_df is empty for me. Even though I'm using the updated get_example_data() function and have the text file it's loading (and it's non-empty). So I'll look into this

kweav commented 3 months ago

And when I tried to run the code as specified in the header (but without the filter function), I got this error

Error in `dplyr::left_join()`:
! Join columns in `x` must be present in
  the data.
✖ Problem with `gene1_symbol`.
---
Backtrace:
    ▆
 1. ├─gimap_dataset %>% gimap_annotate()
 2. ├─global gimap_annotate(.)
 3. │ └─... %>% ...
 4. ├─dplyr::left_join(...)
 5. ├─dplyr::left_join(., depmap_cn, by = c(gene1_symbol = "genes"))
 6. └─dplyr:::left_join.data.frame(., depmap_cn, by = c(gene1_symbol = "genes"))
Run rlang::last_trace(drop = FALSE) to see 4 hidden frames.

Ok, this is because the annotation_df is empty for me. Even though I'm using the updated get_example_data() function and have the text file it's loading (and it's non-empty). So I'll look into this

Once I manually loaded the annotation_df and reran the rest of the steps piecemeal, it worked

kweav commented 3 months ago

Just from looking at the tables, I'm observing slight mismatches in the TPM and CN values. I'll try to make some scatter plots to try to get a better global view of this.

kweav commented 3 months ago

Just from looking at the tables, I'm observing slight mismatches in the TPM and CN values. I'll try to make some scatter plots to try to get a better global view of this.

gene1_copy_number_comparison gene1_TPM_comparison gene2_copy_number_comparison gene2_TPM_comparison

kweav commented 3 months ago

Also compared which pgRNAs had missing values for the TPM and CN annotations

Which comparison gimap # NA GI Mapping # NA # overlap based on pgRNA ID (using intersect)
Gene1 log2 tpm 906 8769 625
Gene 2 log2 tpm 874 750 532
Gene 1 log2 CN 906 8697 601
Gene 2 log2 CN 874 564 500
kweav commented 3 months ago

Hope this was helpful -- Let me know what I can do to help follow-up.

cansavvy commented 3 months ago

Thank you for doing all of the simplifications! Given there are so many more steps in the original than this, what's the best way to know if this simplified product recapitulates the main steps of the original analysis? I think I found the RDS file that was saved from the original analysis and can try to compare to that, but wanted to check if you have anything else in mind?

I think comparing the results is a good strategy. And included in that investigating the differences to see if we think the differences in results are better, same, or worse. If results are same or "better" then the differences are okay.

cansavvy commented 3 months ago

Can you give me the exact code you ran locally? For example I'm running the code in the examples so:

devtools::load_all() 

gimap_dataset <- get_example_data("gimap") %>%
  gimap_filter() %>% 
  gimap_annotate()

Removing the filter step or adding it in doesn't make a difference at this point because its more or less an empty function at this point.

cansavvy commented 3 months ago

Just from looking at the tables, I'm observing slight mismatches in the TPM and CN values. I'll try to make some scatter plots to try to get a better global view of this.

This was really great you ran through this comparison! Thanks for doing this!

At this point in time however we're not applying a filter so these results will be affected by that.

However the CN and TPM values are from the public repository DepMap so those values should be the same. Can you share the code you used to make these plots so I can dig into that?

kweav commented 3 months ago

Can you give me the exact code you ran locally? For example I'm running the code in the examples so:

devtools::load_all() 

gimap_dataset <- get_example_data("gimap") %>%
  gimap_filter() %>% 
  gimap_annotate()

Removing the filter step or adding it in doesn't make a difference at this point because its more or less an empty function at this point.

I've uploaded an .Rmd and its rendered .html in a stacked branch on top of this in PR #32

kweav commented 3 months ago

Just from looking at the tables, I'm observing slight mismatches in the TPM and CN values. I'll try to make some scatter plots to try to get a better global view of this.

This was really great you ran through this comparison! Thanks for doing this!

At this point in time however we're not applying a filter so these results will be affected by that.

However the CN and TPM values are from the public repository DepMap so those values should be the same. Can you share the code you used to make these plots so I can dig into that?

Also in the .Rmd and its rendered .html in a stacked branch on top of this in PR #32

cansavvy commented 3 months ago

Also compared which pgRNAs had missing values for the TPM and CN annotations

Which comparison gimap # NA GI Mapping # NA # overlap based on pgRNA ID (using intersect) Gene1 log2 tpm 906 8769 625 Gene 2 log2 tpm 874 750 532 Gene 1 log2 CN 906 8697 601 Gene 2 log2 CN 874 564 500

These differences are almost certainly because we are using a more recent release of DepMap data than the original code did. The R package used by the original code was downloading 19Q3 whereas now we download DepMap 22Q2 from https://figshare.com/articles/dataset/DepMap_22Q2_Public/19700056

We should probably double check that there are no differences between these datasets that would make us for sure want to return to 19Q3.

cansavvy commented 3 months ago

Going to merge this and we can continue to iterate this in a new PR.