Closed jharenza closed 1 year ago
@jharenza I checked the genomics file manifests, below are the checking results. Do you agree with me re-merging the results using the files found so far?
2/16
have found genomics files.0/14
have found genomics files.0/48
have found genomics files.0/14
have found genomics files.60/74
have found genomics files records, but all are empty files. Example: ea8a9c77-d2f9-4063-ac8d-fa7e4527c8cb.consensus_somatic.norm.annot.public.mafcc @zhangb1
And I will re-merge fusion files based on the histologies-base.tsv
from v13 folder.
missing_maf_ids: 60/74 have found genomics files records, but all are empty files.
For this, I see 14 which are WGS, and it seems that since we have empty MAF files as an allowable value, this should be a case of all 74 samples having files. Maybe those 14 need to be added to the master genomics file manifest. Can you add those first, and make sure we have all 74 files in there? cc @yuankunzhu
For the others, do we know which are allowed to have empty files vs which do not produce any files, so we can know whether we have all of them?
@jharenza I didn't find MAF files for those 14 samples, and I check it in the data service, for the genomic files, there are only cram files. example: https://kf-api-dataservice.kidsfirstdrc.org/genomic-files?biospecimen_id=BS_0SYMPQXP
Seems like they didn't run the somatic analysis, so we don't have records and won't have an empty file.
For the others, do we know which are allowed to have empty files vs which do not produce any files, so we can know whether we have all of them?
As far as I know, we have no such records.
For the others, do we know which are allowed to have empty files vs which do not produce any files, so we can know whether we have all of them?
As far as I know, we have no such records.
@yuankunzhu can your team create a list of which algorithms are expected to have empty files vs no files?
@jharenza I didn't find MAF files for those 14 samples, and I check it in the data service, for the genomic files, there are only cram files. example: https://kf-api-dataservice.kidsfirstdrc.org/genomic-files?biospecimen_id=BS_0SYMPQXP
Seems like they didn't run the somatic analysis, so we don't have records and won't have an empty file.
I don't think that there is a 1:1 expectation of a file being in the genomics file manifest and being registered in the DRC - maybe this is what we are moving towards. I think you may have to iterate through every CAVATICA project and obtain these files, add them to the genomics file manifest, then you can add them to the merge. cc @yuankunzhu if you can inform better. Do you have an SOP on this?
@jharenza I think those 14 samples are TUMOR only samples... we don't have somatic mafs generated for those
@jharenza I think those 14 samples are TUMOR only samples... we don't have somatic mafs generated for those
Then I will re-merge:
histologies-base.tsv
.@jharenza Do you agree with it?
@jharenza I think those 14 samples are TUMOR only samples... we don't have somatic mafs generated for those
I am seeing 10 patients with 16 normals:
> normals <- adapt_2 %>%
+ filter(Kids_First_Participant_ID %in% need_maf$Kids_First_Participant_ID,
+ experimental_strategy == "WGS",
+ sample_type == "Normal") %>%
+ select(Kids_First_Participant_ID, Kids_First_Biospecimen_ID, experimental_strategy, sample_type) %>%
+ arrange(Kids_First_Participant_ID) %>%
+ print(n = 50)
# A tibble: 16 × 4
Kids_First_Participant_ID Kids_First_Biospecimen_ID experimental_strategy sample_type
<chr> <chr> <chr> <chr>
1 PT_4W0BP7F3 BS_43B4VH73 WGS Normal
2 PT_5Q52M9W8 BS_5NXD2WTC WGS Normal
3 PT_5Q52M9W8 BS_CP2JB80M WGS Normal
4 PT_DTP4MMRA BS_CEZVJC67 WGS Normal
5 PT_E0QNEXZ8 BS_0NWDAGKQ WGS Normal
6 PT_E0QNEXZ8 BS_MS5T612F WGS Normal
7 PT_JSFBMK5V BS_H4KBJDGN WGS Normal
8 PT_T58EGJRX BS_SX0GY5NN WGS Normal
9 PT_T58EGJRX BS_T81BRRVM WGS Normal
10 PT_WKYPN77B BS_4SYCAGCF WGS Normal
11 PT_X0GD01P1 BS_DKV2THEC WGS Normal
12 PT_X0GD01P1 BS_XATN0RZD WGS Normal
13 PT_XWYNQBTK BS_5Y6EK1XC WGS Normal
14 PT_XWYNQBTK BS_882Q6W23 WGS Normal
15 PT_XYBDYQDP BS_4BWBYF3C WGS Normal
16 PT_XYBDYQDP BS_864PZH47 WGS Normal
Seems like only PT_JRA29R0N
does not have a normal.
What about the WXS sample, which is mioncoseq: BS_RES3E00R
- is that missing variants or is the maf file missing?
cc @yuankunzhu
@HuangXiaoyan0106 I think doing
- fusion: only based on histologies-base.tsv.
right now will be ok
We might want to pause on the CNV since we might have some more WGS harmonization upcoming.
@jharenza Okay, sorry I found the issue, those 14 pairs are Unmatched in our NGScheck run, please see the details in the X01 matches 3.
@jharenza Okay, sorry I found the issue, those 14 pairs are Unmatched in our NGScheck run, please see the details in the X01 matches 3.
ok thanks, good to know. Do we know if they have any matches to CBTN?
Also, did we check the RNA for mismatch as well since the RNA and DNA are co-extracted from the same tissue in most cases?
@HuangXiaoyan0106 I think you can move forward with all merges as proposed now. I will remove these 14 mismatches from histologies.
8337bb797d664e15c40b346ef696a592 cnv-cnvkit.seg.gz
c1e91dc8235b67280230d4e8110db51f cnv-gatk.seg.gz
8fe3d574679def5becc6f216c780b46f fusion-annoFuse.tsv.gz
b2f30598c875ca90c8ec3070d7a8ca0a fusion-arriba.tsv.gz
86ab7f2ab2761148a28cfe548347e815 fusion-starfusion.tsv.gz
new merged files are in the v13 folder now. md5 file is updated too.
thanks @zhangb1 !
hey @zhangb1 it looks like these are good, except that annoFuse has samples in it which are not in histologies: https://github.com/d3b-center/OpenPedCan-analysis/blob/672955532d0f60b31c75664ef6b7f3b29106e7b6/analyses/data-pre-release-qc/results/fusion-annoFuse-samples-missing-in-histologies.tsv
Can you re-merge that one using only those in histologies?
@jharenza just checked ,those are in the histology .... the v13 one, can you check again??
example:
TARGET-50-CAAAAR-11A-01R 11 11A CAAAAR RNA-Seq Normal Solid Tissue NA Kidney Female NA Not Reported 314 NA NA poly-A stranded 3542 LIVING NA TARGET NA BCCAGSC NA NA NA TARGET-50-CAAAAR NA NA NA NA NA NA NA NA Female NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Thanks @zhangb1 - seems maybe the code is wrong ? @zzgeng can you check why those files come up as missing in histology
All these samples showed in annoFuse but not in histologies are sample_type == "normal"
, which were filtered out in the script (https://github.com/d3b-center/OpenPedCan-analysis/blob/dev/analyses/data-pre-release-qc/01-data-harmonization-qc.Rmd#L680C1-L684C41). Should I change the script? @jharenza
Ahh, I see - yes you can change the script to assess any RNA-Seq
These seem to look good at the moment, so I will close this.
What data file(s) does this issue pertain to?
Put your question or report your issue here.
Per the QC code run here, we will need a re-merge of several below files. Below are the print statements in the HTML file of the QC code with some explanation.
Please:
s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v13/
md5sum.txt
and put in v13 folders3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v13/
pre-release_QC
branch, rerun QC code and commit to branch.What release are you using?
v13 pre-release files
cc @yuankunzhu @zzgeng