v13 merge files updates and QC code rerun

jharenza commented 1 year ago

What data file(s) does this issue pertain to?

Put your question or report your issue here.

Per the QC code run here, we will need a re-merge of several below files. Below are the print statements in the HTML file of the QC code with some explanation.

Please:

review
re-merge where required
put in the v13 folder s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v13/
update md5sum.txt and put in v13 folder s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v13/
Clone repo, download data, pull pre-release_QC branch, rerun QC code and commit to branch.

[1] "DNA biospecimen in histolgies missing in cnv-cnvkit.seg.gz =  16" --> check whether expected (ie results empty), else re-merge
[1] "DNA biospecimen in histolgies missing in cnv-controlfreec.tsv.gz =  14" --> check whether expected (ie results empty), else re-merge
[1] "DNA biospecimen in histolgies missing in cnv-gatk.seg.gz =  48" --> check whether this is expected (eg. Hope samples did not have this run because no PON able to be generated, so this may explain it)
[1] "DNA biospecimen in cnv-gatk.seg.gz missing in histolgies =  105" --> these are WXS but we are not running WXS on gatk CNV. Please check 1) the pipeline and 2) that these are not in merge matrices 3) re-merge using the histologies file WGS samples only
[1] "DNA biospecimen in histolgies missing in snv-consensus-plus-hotspots.maf.tsv.gz =  74" --> likely missing a lot of these in the SNV consensus and will need re-merge.
[1] "DNA biospecimen in histolgies missing in sv-manta.tsv.gz =  14" --> check whether these have 0 SVs and if this is expected
[1] "RNA biospecimen in fusion-annoFuse.tsv.gz missing in histolgies =  16" --> these seem to have been missed in v12 - there should be no samples in any merge matrix which are not in the histologies file. Need a re-merge using the histologies file
[1] "RNA biospecimen in fusion-arriba.tsv.gz missing in histolgies =  16"  --> these seem to have been missed in v12 - there should be no samples in any merge matrix which are not in the histologies file. Need a re-merge using the histologies file
[1] "RNA biospecimen in fusion-starfusion.tsv.gz missing in histolgies =  6"  --> these seem to have been missed in v12 - there should be no samples in any merge matrix which are not in the histologies file. Need a re-merge using the histologies file

What release are you using?

v13 pre-release files

cc @yuankunzhu @zzgeng

HuangXiaoyan0106 commented 1 year ago

@jharenza I checked the genomics file manifests, below are the checking results. Do you agree with me re-merging the results using the files found so far?

check_details.xls

missing_cnvkit_ids: 2/16 have found genomics files.
missing_controlfreec_ids: 0/14 have found genomics files.
missing_gatk_ids: 0/48 have found genomics files.
missing_manta_ids: 0/14 have found genomics files.
missing_maf_ids: 60/74 have found genomics files records, but all are empty files. Example: ea8a9c77-d2f9-4063-ac8d-fa7e4527c8cb.consensus_somatic.norm.annot.public.maf

cc @zhangb1

And I will re-merge fusion files based on the histologies-base.tsv from v13 folder.

jharenza commented 1 year ago

missing_maf_ids: 60/74 have found genomics files records, but all are empty files.

For this, I see 14 which are WGS, and it seems that since we have empty MAF files as an allowable value, this should be a case of all 74 samples having files. Maybe those 14 need to be added to the master genomics file manifest. Can you add those first, and make sure we have all 74 files in there? cc @yuankunzhu

For the others, do we know which are allowed to have empty files vs which do not produce any files, so we can know whether we have all of them?

HuangXiaoyan0106 commented 1 year ago

@jharenza I didn't find MAF files for those 14 samples, and I check it in the data service, for the genomic files, there are only cram files. example: https://kf-api-dataservice.kidsfirstdrc.org/genomic-files?biospecimen_id=BS_0SYMPQXP

Seems like they didn't run the somatic analysis, so we don't have records and won't have an empty file.

HuangXiaoyan0106 commented 1 year ago

For the others, do we know which are allowed to have empty files vs which do not produce any files, so we can know whether we have all of them?

As far as I know, we have no such records.

jharenza commented 1 year ago

For the others, do we know which are allowed to have empty files vs which do not produce any files, so we can know whether we have all of them?

As far as I know, we have no such records.

@yuankunzhu can your team create a list of which algorithms are expected to have empty files vs no files?

jharenza commented 1 year ago

@jharenza I didn't find MAF files for those 14 samples, and I check it in the data service, for the genomic files, there are only cram files. example: https://kf-api-dataservice.kidsfirstdrc.org/genomic-files?biospecimen_id=BS_0SYMPQXP

Seems like they didn't run the somatic analysis, so we don't have records and won't have an empty file.

I don't think that there is a 1:1 expectation of a file being in the genomics file manifest and being registered in the DRC - maybe this is what we are moving towards. I think you may have to iterate through every CAVATICA project and obtain these files, add them to the genomics file manifest, then you can add them to the merge. cc @yuankunzhu if you can inform better. Do you have an SOP on this?

zhangb1 commented 1 year ago

@jharenza I think those 14 samples are TUMOR only samples... we don't have somatic mafs generated for those

HuangXiaoyan0106 commented 1 year ago

@jharenza I think those 14 samples are TUMOR only samples... we don't have somatic mafs generated for those

Then I will re-merge:

fusion: only based on histologies-base.tsv.
cnv-gatk: only WGS samples
cnvkit : + 2 missing samples.

@jharenza Do you agree with it?

jharenza commented 1 year ago

@jharenza I think those 14 samples are TUMOR only samples... we don't have somatic mafs generated for those

I am seeing 10 patients with 16 normals:

> normals <- adapt_2 %>%
+   filter(Kids_First_Participant_ID %in% need_maf$Kids_First_Participant_ID,
+          experimental_strategy == "WGS",
+          sample_type == "Normal") %>%
+   select(Kids_First_Participant_ID, Kids_First_Biospecimen_ID, experimental_strategy, sample_type) %>%
+   arrange(Kids_First_Participant_ID) %>%
+   print(n = 50)
# A tibble: 16 × 4
   Kids_First_Participant_ID Kids_First_Biospecimen_ID experimental_strategy sample_type
   <chr>                     <chr>                     <chr>                 <chr>      
 1 PT_4W0BP7F3               BS_43B4VH73               WGS                   Normal     
 2 PT_5Q52M9W8               BS_5NXD2WTC               WGS                   Normal     
 3 PT_5Q52M9W8               BS_CP2JB80M               WGS                   Normal     
 4 PT_DTP4MMRA               BS_CEZVJC67               WGS                   Normal     
 5 PT_E0QNEXZ8               BS_0NWDAGKQ               WGS                   Normal     
 6 PT_E0QNEXZ8               BS_MS5T612F               WGS                   Normal     
 7 PT_JSFBMK5V               BS_H4KBJDGN               WGS                   Normal     
 8 PT_T58EGJRX               BS_SX0GY5NN               WGS                   Normal     
 9 PT_T58EGJRX               BS_T81BRRVM               WGS                   Normal     
10 PT_WKYPN77B               BS_4SYCAGCF               WGS                   Normal     
11 PT_X0GD01P1               BS_DKV2THEC               WGS                   Normal     
12 PT_X0GD01P1               BS_XATN0RZD               WGS                   Normal     
13 PT_XWYNQBTK               BS_5Y6EK1XC               WGS                   Normal     
14 PT_XWYNQBTK               BS_882Q6W23               WGS                   Normal     
15 PT_XYBDYQDP               BS_4BWBYF3C               WGS                   Normal     
16 PT_XYBDYQDP               BS_864PZH47               WGS                   Normal

Seems like only PT_JRA29R0N does not have a normal.

What about the WXS sample, which is mioncoseq: BS_RES3E00R - is that missing variants or is the maf file missing?

cc @yuankunzhu

zhangb1 commented 1 year ago

https://cavatica.sbgenomics.com/u/d3b-bixu-ops/sd-bhjxbdqk-mioncoseq-somatic-mutations-hotspot-rerun/tasks/b51b3aff-619c-4fff-9a95-f7ce24522164/ BS_RES3E00R is empty output maf

jharenza commented 1 year ago

@HuangXiaoyan0106 I think doing

fusion: only based on histologies-base.tsv.

right now will be ok

We might want to pause on the CNV since we might have some more WGS harmonization upcoming.

zhangb1 commented 1 year ago

@jharenza Okay, sorry I found the issue, those 14 pairs are Unmatched in our NGScheck run, please see the details in the X01 matches 3.

https://chop365-my.sharepoint.com/:x:/r/personal/zhangb1_chop_edu/_layouts/15/Doc.aspx?sourcedoc=%7B9B7B36C4-493D-4193-BA66-F3BDC85677B0%7D&file=CBTN%20X01%20Tumor-Normal%20pairing.xlsx&action=default&mobileredirect=true

jharenza commented 1 year ago

@jharenza Okay, sorry I found the issue, those 14 pairs are Unmatched in our NGScheck run, please see the details in the X01 matches 3.

https://chop365-my.sharepoint.com/❌/r/personal/zhangb1_chop_edu/_layouts/15/Doc.aspx?sourcedoc=%7B9B7B36C4-493D-4193-BA66-F3BDC85677B0%7D&file=CBTN%20X01%20Tumor-Normal%20pairing.xlsx&action=default&mobileredirect=true

ok thanks, good to know. Do we know if they have any matches to CBTN?

Also, did we check the RNA for mismatch as well since the RNA and DNA are co-extracted from the same tissue in most cases?

jharenza commented 1 year ago

@HuangXiaoyan0106 I think you can move forward with all merges as proposed now. I will remove these 14 mismatches from histologies.

zhangb1 commented 1 year ago

8337bb797d664e15c40b346ef696a592  cnv-cnvkit.seg.gz
c1e91dc8235b67280230d4e8110db51f  cnv-gatk.seg.gz
8fe3d574679def5becc6f216c780b46f  fusion-annoFuse.tsv.gz
b2f30598c875ca90c8ec3070d7a8ca0a  fusion-arriba.tsv.gz
86ab7f2ab2761148a28cfe548347e815  fusion-starfusion.tsv.gz

new merged files are in the v13 folder now. md5 file is updated too.

jharenza commented 1 year ago

thanks @zhangb1 !

jharenza commented 1 year ago

hey @zhangb1 it looks like these are good, except that annoFuse has samples in it which are not in histologies: https://github.com/d3b-center/OpenPedCan-analysis/blob/672955532d0f60b31c75664ef6b7f3b29106e7b6/analyses/data-pre-release-qc/results/fusion-annoFuse-samples-missing-in-histologies.tsv

Can you re-merge that one using only those in histologies?

zhangb1 commented 1 year ago

@jharenza just checked ,those are in the histology .... the v13 one, can you check again??

example:

TARGET-50-CAAAAR-11A-01R    11  11A CAAAAR  RNA-Seq Normal  Solid Tissue    NA  Kidney  Female  NA  Not Reported    314 NA  NA  poly-A stranded 3542    LIVING  NA  TARGET  NA  BCCAGSC NA  NA  NA  TARGET-50-CAAAAR    NA  NA  NA  NA  NA  NA  NA  NA  Female  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

jharenza commented 1 year ago

Thanks @zhangb1 - seems maybe the code is wrong ? @zzgeng can you check why those files come up as missing in histology

zzgeng commented 1 year ago

All these samples showed in annoFuse but not in histologies are sample_type == "normal", which were filtered out in the script (https://github.com/d3b-center/OpenPedCan-analysis/blob/dev/analyses/data-pre-release-qc/01-data-harmonization-qc.Rmd#L680C1-L684C41). Should I change the script? @jharenza

jharenza commented 1 year ago

Ahh, I see - yes you can change the script to assess any RNA-Seq

jharenza commented 1 year ago

These seem to look good at the moment, so I will close this.

d3b-center / OpenPedCan-analysis