For WGS-preferred lists: we first subset the tumor samples to WGS samples only and generate WGS-specific lists. These lists only contain a single occurence of Kids_First_Participant_ID associated to the experimental_strategy = WGS. Next, we subset the tumor samples to WXS and Targeted Sequencing to generate WXS/Panel specific lists. These lists only contain a single occurence of Kids_First_Participant_ID associated to either experimental_strategy = WXS or experimental_strategy = Targeted Sequencing. Then we merge the two lists generated above i.e. 1. WGS-specific list and 2. WXS/Panel specific list keeping WGS list first and WXS/Panel as second and take a dplyr::distinct to only get the first occurence of Kids_First_Participant_ID.
Because we keep the WGS-specific list first when calling dplyr::distinct, the WGS associated biospecimens will be preferred over other biospecimens for multiple occurrences of Kids_First_Participant_ID.
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
[ ] The 03-qc-independent-samples.nb.html now contains total rows and number of unique Kids_First_Participant_ID in each file for each of the analysis type i.e. WGS-only, WGS-preferred and WXS-preferred for easy lookup and comparisons.
[ ] Wasn't sure if we needed to refactor the RNA-seq lists.
Is there anything that you want to discuss further?
Feasibility of the approach
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
Tables of each-cohort and all-cohorts for each of the analysis type mentioned above
What is your summary of the results?
TBD
Reproducibility Checklist
[ ] The dependencies required to run the code in this pull request have been added to the project Dockerfile.
[ ] This analysis has been added to continuous integration.
Documentation Checklist
[ ] This analysis module has a README and it is up to date.
[ ] This analysis is recorded in the table in analyses/README.md and the entry is up to date.
[ ] The analytical code is documented and contains comments.
Purpose/implementation Section
Update and simplify the independent sample selection
What scientific question is your analysis addressing?
Generate WGS-only, WGS-preferred and WXS-preferred lists.
What was your approach?
The idea implemented here is simple:
After randomizing the histology file, subset to tumor samples:
For
WGS-preferred
lists: we first subset the tumor samples toWGS
samples only and generateWGS-specific
lists. These lists only contain a single occurence ofKids_First_Participant_ID
associated to theexperimental_strategy = WGS
. Next, we subset the tumor samples toWXS and Targeted Sequencing
to generateWXS/Panel specific
lists. These lists only contain a single occurence ofKids_First_Participant_ID
associated to eitherexperimental_strategy = WXS
orexperimental_strategy = Targeted Sequencing
. Then we merge the two lists generated above i.e. 1.WGS-specific list
and 2.WXS/Panel specific list
keepingWGS
list first andWXS/Panel
as second and take adplyr::distinct
to only get the first occurence ofKids_First_Participant_ID
. Because we keep theWGS-specific list
first when callingdplyr::distinct
, theWGS
associated biospecimens will be preferred over other biospecimens for multiple occurrences ofKids_First_Participant_ID
.For
WXS-preferred
lists: Same approach as above.What GitHub issue does your pull request address?
https://github.com/PediatricOpenTargets/ticket-tracker/issues/193
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
[ ] The
03-qc-independent-samples.nb.html
now contains total rows and number of uniqueKids_First_Participant_ID
in each file for each of the analysis type i.e. WGS-only, WGS-preferred and WXS-preferred for easy lookup and comparisons.[ ] Wasn't sure if we needed to refactor the RNA-seq lists.
Is there anything that you want to discuss further?
Feasibility of the approach
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
Tables of each-cohort and all-cohorts for each of the analysis type mentioned above
What is your summary of the results?
TBD
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.