Ind samples update - Githubissues

Purpose/implementation Section

Update and simplify the independent sample selection

What scientific question is your analysis addressing?

Generate WGS-only, WGS-preferred and WXS-preferred lists.

What was your approach?

The idea implemented here is simple:

After randomizing the histology file, subset to tumor samples:

tumor_samples <- histology_df %>%
  dplyr::filter(sample_type == "Tumor", 
                composition == "Solid Tissue" | composition == "Bone Marrow", 
                experimental_strategy %in% c("WGS", "WXS", "Targeted Sequencing"))

For WGS-preferred lists: we first subset the tumor samples to WGS samples only and generate WGS-specific lists. These lists only contain a single occurence of Kids_First_Participant_ID associated to the experimental_strategy = WGS. Next, we subset the tumor samples to WXS and Targeted Sequencing to generate WXS/Panel specific lists. These lists only contain a single occurence of Kids_First_Participant_ID associated to either experimental_strategy = WXS or experimental_strategy = Targeted Sequencing. Then we merge the two lists generated above i.e. 1. WGS-specific list and 2. WXS/Panel specific list keeping WGS list first and WXS/Panel as second and take a dplyr::distinct to only get the first occurence of Kids_First_Participant_ID. Because we keep the WGS-specific list first when calling dplyr::distinct, the WGS associated biospecimens will be preferred over other biospecimens for multiple occurrences of Kids_First_Participant_ID.

For WXS-preferred lists: Same approach as above.

What GitHub issue does your pull request address?

https://github.com/PediatricOpenTargets/ticket-tracker/issues/193

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

[ ] The 03-qc-independent-samples.nb.html now contains total rows and number of unique Kids_First_Participant_ID in each file for each of the analysis type i.e. WGS-only, WGS-preferred and WXS-preferred for easy lookup and comparisons.
[ ] Wasn't sure if we needed to refactor the RNA-seq lists.

Is there anything that you want to discuss further?

Feasibility of the approach

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

Tables of each-cohort and all-cohorts for each of the analysis type mentioned above

What is your summary of the results?

TBD

Reproducibility Checklist

[ ] The dependencies required to run the code in this pull request have been added to the project Dockerfile.
[ ] This analysis has been added to continuous integration.

Documentation Checklist

[ ] This analysis module has a README and it is up to date.
[ ] This analysis is recorded in the table in analyses/README.md and the entry is up to date.
[ ] The analytical code is documented and contains comments.

AlexsLemonade / OpenPBTA-analysis

Ind samples update #1181