AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
100 stars 67 forks source link

Ind samples update #1181

Closed komalsrathi closed 3 years ago

komalsrathi commented 3 years ago

Purpose/implementation Section

Update and simplify the independent sample selection

What scientific question is your analysis addressing?

Generate WGS-only, WGS-preferred and WXS-preferred lists.

What was your approach?

The idea implemented here is simple:

After randomizing the histology file, subset to tumor samples:

tumor_samples <- histology_df %>%
  dplyr::filter(sample_type == "Tumor", 
                composition == "Solid Tissue" | composition == "Bone Marrow", 
                experimental_strategy %in% c("WGS", "WXS", "Targeted Sequencing"))

For WGS-preferred lists: we first subset the tumor samples to WGS samples only and generate WGS-specific lists. These lists only contain a single occurence of Kids_First_Participant_ID associated to the experimental_strategy = WGS. Next, we subset the tumor samples to WXS and Targeted Sequencing to generate WXS/Panel specific lists. These lists only contain a single occurence of Kids_First_Participant_ID associated to either experimental_strategy = WXS or experimental_strategy = Targeted Sequencing. Then we merge the two lists generated above i.e. 1. WGS-specific list and 2. WXS/Panel specific list keeping WGS list first and WXS/Panel as second and take a dplyr::distinct to only get the first occurence of Kids_First_Participant_ID. Because we keep the WGS-specific list first when calling dplyr::distinct, the WGS associated biospecimens will be preferred over other biospecimens for multiple occurrences of Kids_First_Participant_ID.

For WXS-preferred lists: Same approach as above.

What GitHub issue does your pull request address?

https://github.com/PediatricOpenTargets/ticket-tracker/issues/193

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Feasibility of the approach

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

Tables of each-cohort and all-cohorts for each of the analysis type mentioned above

What is your summary of the results?

TBD

Reproducibility Checklist

Documentation Checklist