FredHutch / gimap

Genetic Interaction MAPping for dual target CRISPR screens
https://fredhutch.github.io/gimap/
0 stars 0 forks source link

Filtering step runs and stores result #40

Closed kweav closed 3 months ago

kweav commented 3 months ago

:warning: Please look at PR #41 first since this is based on that ⚠️

As of this PR: The filtering steps can now be run all together and filtering stores results of filtering; also includes another df in report about using multiple filters

To make these changes, I've:

(a) I've added a list for gimap_dataset$filtered_data within R/00-setup_data.R so that we store the filtered data, metadata, etc. I'll note that in this section, I've included -- a boolean since the filtering step is optional so later steps can see if the filtering step was run/should use that data. -- the transformed log2 CPM values (filtered) -- the pgRNA IDs metadata (filtered) -- the pgRNA general metadata (filtered) Note: noticed that this value was NULL, and wondering if this variable isn't needed because it's not really referenced in the code before this, it's NULL in the example gimap_dataset we load from gimap, and the annotation step seems to be grabbing its own metadata?

(b) Fixed a bug in R/01-qc.R where I passed the wrong parameter names to the .Rmd template.

(c) Renamed the output in the list from the qc_filter_plasmid() function to match the other filter function (filter and reportdf) and changed how I call them in R/02-filter.R and the template QC .Rmd

(d) Reverified/added target column selection parameters within the QC .Rmd as necessary

(e) Included finding a consensus filter of the possible filters (that works no matter how many filters the user wants)

(f) Stores the results following the consensus filter in the appropriate gimap_dataset$filtered_data list items (referenced in (a))

(g) Changed how the filter report df's are shown in the QC Template Rmd because wanted to be able to report number that would be filtered if combination of filters were used. So stored the full filtering output for each and called the reportdf immediately to display it, but then combined the potential filters themselves later on.

(h) Included a table with inline code to report number and percent or overlap of pgRNA constructs that would be filtered given various combinations of the filters (to better help the user make informed decisions on which filters they want to use)

(i) finally, added the filter step to the vignette and rendered it locally to make sure that the steps ran without error and I had the expected number of pgRNA constructs left after filtering.

Requested Review:

The requested review for this is:

(1) a verification that it works for you too (2) is the documentation sufficient outside of the vignette? I know documentation needs to be added to the vignette, but want to wait until we finalize this code/these steps (3) Does the setup/returned product work with next steps? (e.g., Do we need other data to be filtered/stored (like count_norm) or just log2_cpm? I looked at the old code and it seems like log2_cpm is the main workhorse). (4) Where/when should I output a report of pgRNA constructs which have a raw count of 0 for all replicates? I'm thinking that's just good information to have whether those constructs are filtered or not, and so I can output that to another file when the bar plot is plotted in the QC template. (5) Good next step to output a list of the pgRNA IDs that are filtered out? If so, where should that file live? (6) Any other spots that I haven't notated throughout these stacked PRs that could use checking user input for validation?

Thank you! And please let me know if you have any questions or I can make something clearer. I noticed in a PR that Howard left inline comments explaining changes and that was really helpful for me, so I'll do that too connecting them to my lettered list above