Closed ewafula closed 1 year ago
The short background is that I normally try to use cBio IDs I have already made to make openpedcan comparable to existing studies on cBio. A set of IDs briefly gave me naming trouble, likely because on data service that are marked as visible false
. This is what I found:
in OPC visible false
@jharenza most are DGD samples, but two are PBTA. I am not sure how to handle. I can ignore and handle the remaining sample naming issues I have (just the two PBTA ones) manually, or if anyone knows of why they should/should not be hidden, that'd be nice.
Also I should mention that I know that the histologies file isn't 100% ready yet, but using sort of a "pre-release" version to head off possible blockers, which seems to be paying off already
The short background is that I normally try to use cBio IDs I have already made to make openpedcan comparable to existing studies on cBio. A set of IDs briefly gave me naming trouble, likely because on data service that are marked as visible
false
. This is what I found: in OPC visible false @jharenza most are DGD samples, but two are PBTA. I am not sure how to handle. I can ignore and handle the remaining sample naming issues I have (just the two PBTA ones) manually, or if anyone knows of why they should/should not be hidden, that'd be nice.
I have no idea why these are hidden - I see some slack from 2 years ago where the PBTA samples might have been hidden just bc they didn't have an external_aliquot_id
, but we had them in V11 and I don't see any reason to hide. cc @youngnm @baileyckelly
For DGD, I also have no idea why some DGD would be hidden - @nicholasvk would you have insight into this? As far as I know, we should retain.
Just making a note of an edge case - sample 7316-6779_DGD_45
is in the histologies file for having a maf and fusion, but there are no fusions in the fusion-dgd.tsv.gz
, and the two variant calls are synonymous/intronic. For now, I will remove from the gene panel file, as a variant calls gas to exist in the maf file, and try to code a way to deal with it later.
just to clarify - FNL identified some artifacts w/ the DGD SNV frequencies tables during the pre-release QC, so for v12 the decision was to exclude the DGD from the SNV tables (thus no link to cbioportal in that table for this data set). The DGD data are being used for fusion and molecular subtyping. @ewafula @jharenza
Ok other issues to be resolved: There are repeat breakpoints in the fusion file. This is not allowed in cBio. For example:
BS_00FD2KMP 6:158818082 6:4431475 EZR--ENSG00000285424 frameshift EZR NA ENSG00000285424 NA CosmicCensus NA NA NA NA ARRIBA 1 NA NA FALSE PT_SW4Q1HZP [INTRACHROMOSOMAL[chr6:154.18Mb]], deletion Genic
BS_00FD2KMP 6:158818082 6:4431475 EZR--ENSG00000285424 other EZR NA ENSG00000285424 NA CosmicCensus NA NA NA NA STARFUSION 1 NA NA FALSE PT_SW4Q1HZP [INTRACHROMOSOMAL[chr6:154.18Mb]] Genic
BS_028YFYJ6 3:53846648 3:39169937 IL17RB--ENSG00000284669 frameshift IL17RB NA ENSG00000284669 NA Oncogene NA NA NA NA ARRIBA 1 NA NA FALSE PT_394ZA6P7 [INTRACHROMOSOMAL[chr3:14.67Mb]], duplication Genic
BS_028YFYJ6 3:53846648 3:39169937 IL17RB--ENSG00000284669 other IL17RB NA ENSG00000284669 NA Oncogene NA NA NA NA STARFUSION 1 NA NA FALSE PT_394ZA6P7 [INTRACHROMOSOMAL[chr3:14.67Mb]] Genic
DGD maf has NA
values for some of the numeric fields, like t_ref_count
, n_alt_count
. It seems also that Mutation_Status
must have some kind of controlled vocab, as an error is thrown for NA
values. I think instead of NA
, blanks are preferred.
DGD fusion just needs a lot of review - I think wither the format changed from v11 to v12, or the test data I was given was outdated, so now my ETL doesn't work for it.
Looking into a number of the DGD samples and the ones I looked into are all in the event link table and have been approved by DGD GCs for transfer. So it doesn't look like a consent issue. I think Nick will have to see why they ended up getting marked as not visible.
Ok, thanks! That helps. It's kinda weird that those NA
values are in the openpedcan maf, but seemingly not in the individual mafs I used to load for pbta_all
, which is in the bix_genomics_file."dgd_maf-genomics_file_manifest"
table. Why is that @luederm ? Was it a like a pandas merge thing? Would it be possible to regen with using blank strings when a field is empty, or does OpenPedCan require these NA
values?
@chinwallaa I guess for the sake of time, unless someone can correct this, I want to reiterate this situation in the fusion-putative-oncogenic.tsv
:
BS_00FD2KMP 6:158818082 6:4431475 EZR--ENSG00000285424 frameshift EZR NA ENSG00000285424 NA CosmicCensus NA NA NA NA ARRIBA 1 NA NA FALSE PT_SW4Q1HZP [INTRACHROMOSOMAL[chr6:154.18Mb]], deletion Genic
BS_00FD2KMP 6:158818082 6:4431475 EZR--ENSG00000285424 other EZR NA ENSG00000285424 NA CosmicCensus NA NA NA NA STARFUSION 1 NA NA FALSE PT_SW4Q1HZP [INTRACHROMOSOMAL[chr6:154.18Mb]] Genic
BS_028YFYJ6 3:53846648 3:39169937 IL17RB--ENSG00000284669 frameshift IL17RB NA ENSG00000284669 NA Oncogene NA NA NA NA ARRIBA 1 NA NA FALSE PT_394ZA6P7 [INTRACHROMOSOMAL[chr3:14.67Mb]], duplication Genic
BS_028YFYJ6 3:53846648 3:39169937 IL17RB--ENSG00000284669 other IL17RB NA ENSG00000284669 NA Oncogene NA NA NA NA STARFUSION 1 NA NA FALSE PT_394ZA6P7 [INTRACHROMOSOMAL[chr3:14.67Mb]] Genic
Note that these are the exact same fusions, exact same breakpoints. Only difference is the Fusion_Type
column differs, but seems more accurate in the ARRIBA
call. Therefore, they should have been merged into one line and caller.count
should be 2
, know what I mean? If there is a way to merge this properly, that would free up my time to work out the DGD fusion format issue I encountered. Otherwise, me addressing both issues may make loading this project "ASAP" more difficult.
Also, this is just 2 examples, the validator says this happens 851 times
Thanks @migbro , is there a reason for these 851 we dont just use the ARIBA calls vs the STAR FUSION calls ? @jharenza @ewafula what is the consenus logic for fusion calls - use ARIBA if exists, and if not use STARFUSION - did the logic change from v11 to v12 ?
@migbro I think you had collapsed these fusions in the latest pbta_all load- can you just do the same here?
Well, I collapsed using junction read counts. Also, it seems you already attempt to do this:
BS_00FD2KMP 3:160356013 14:67809289 IFT80--ZFYVE26 in-frame IFT80 NA ZFYVE26 NA NA Oncogene NA NA NA STARFUSION, ARRIBA 2 NA NA FALSE PT_SW4Q1HZP [INTERCHROMOSOMAL[chr3--chr14]], translocation Genic
For example there, your group collapsed. All I am saying is you're kind of asking me to collapse again to fix this. Which I can do, it just slows things down and feels like yet another band-aid.
@chinwallaa there was no change on our end, this is a new update to pedcbio which now requires / includes breakpoints causing some issues - @migbro if you can outline exactly what changes need to happen to putative oncogenic to get it into pedcbio load shape in a new ticket, someone can work on it.
Sure, I can make a ticket as the current method seems to fall short on collapsing on common breakpoints as your deliverable. It'll probably benefit downstream uses anyway as I'd think repeat breakpoints would require users to account for that to prevent double-counting, etc.
@chinwallaa @jharenza created this ticket here: https://github.com/PediatricOpenTargets/ticket-tracker/issues/533 Please assign whomever you believe is best suited
A positive update, I was able to resolve:
DGD fusion just needs a lot of review - I think wither the format changed from v11 to v12, or the test data I was given was outdated, so now my ETL doesn't work for it.
For this:
DGD maf has NA values for some of the numeric fields, like t_ref_count, n_alt_count. It seems also that Mutation_Status must have some kind of controlled vocab, as an error is thrown for NA values. I think instead of NA, blanks are preferred.
@chinwallaa I can't tell if your comment:
just to clarify - FNL identified some artifacts w/ the DGD SNV frequencies tables during the pre-release QC, so for v12 the decision was to exclude the DGD from the SNV tables (thus no link to cbioportal in that table for this data set). The DGD data are being used for fusion and molecular subtyping.
Was to encourage dropping DGD snvs for this load...otherwise either I replace NA
in those fields with blank strings, or @luederm recreates the maf and does as such
Ok, thanks! That helps. It's kinda weird that those
NA
values are in the openpedcan maf, but seemingly not in the individual mafs I used to load forpbta_all
, which is in thebix_genomics_file."dgd_maf-genomics_file_manifest"
table. Why is that @luederm ? Was it a like a pandas merge thing? Would it be possible to regen with using blank strings when a field is empty, or does OpenPedCan require theseNA
values?
The MAFs from the dgd_maf-genomics_file_manifest were generated from VCF files by Bo. I am not sure how the OpenPedCan was generated. The last version of the MAF I provided is here: https://github.com/d3b-center/bixu-tracker/issues/1694#issuecomment-1397204620. I can remove the NA values. Just n_ref_count and n_alt_count?
@luederm :
Mutation_Status
n_ref_count
n_alt_count
t_ref_count
Thanks!
Actually the current maf is in s3, v12 folder - there was an issue with the maf bo and team generated which had non matching column classes and we had to remove more artifacts. Please use what's in the v12 bucket
@luederm @migbro any further analysis on DGD won't be included in v12 - we need a process for after v12 which may be running it from scratch through D3b workflows as discussed in slack with @aadamk
Okay, so no need to remove the NAs for now?
Just took a look - it's the same maf, so please remove NA
ok, @luederm if you can remove NA from https://d3b-openaccess-us-east-1-prd-pbta.s3.amazonaws.com/open-targets/v12/snv-dgd.maf.tsv.gz for miguel, that is the latest maf
Also, a quick question hopefully @luederm , from this merged maf, aside from the usual synonymous/intergenic, etc filtering, should I use any of the following added info to remove variants from loading on to cBio (numbers are just the instance count of each value):
AIQC
240 Indel
300 Indel_v2
28932 Not available
500 OK
3 OK_screener
575 OK_v2
117 Uncertain
2 Uncertain_feature_missing
1553 Uncertain_v2
Variant_Tier
423 2
1546 3
265 4
29988 NA
Pipeline_version
2951 Early 2016
2521 SomaticPanel-v0.1.0
686 SomaticPanel-v1.0.0
1485 SomaticPanel-v1.0.2
681 SomaticPanel-v3.0.2
71 SomaticPanel-v3.1.0
2362 SomaticPanel-v3.2.0
208 SomaticPanel-v3.3.0
842 SomaticPanel-v3.3.1
115 SomaticPanel-v3.3.2
161 SomaticPanel-v3.3.3
44 SomaticPanel-v3.4.0
1305 SomaticPanel-v3.4.1
291 SomaticPanel-v3.4.2
693 SomaticPanel-v3.4.3
122 SomaticPanel-v3.4.4
376 canto-1.0
4358 canto-2.0
2247 canto-2.1
1735 canto-2.2
2970 canto-2.3
3488 canto-2.4
2510 canto-2.5
If we want to limit artifacts, we could filter out AIQC='Not available'. This will remove a large majority of the artifacts, but will completely remove older samples and will remove variants that were filtered before AIQC (variants with high MAF and synonymous variants with no HGMD/COSMIC class).
If we want to limit artifacts, we could filter out AIQC='Not available'. This will remove a large majority of the artifacts, but will completely remove older samples and will remove variants that were filtered before AIQC (variants with high MAF and synonymous variants with no HGMD/COSMIC class).
@luederm hmmm, could it be something like, remove AIQC='Not available' unless pipeline is x? Would that help balance things?
Almost there - there is an extra index column it seems in this file @luederm
Here it is with the index removed snv-dgd.na_removed.maf.tsv.gz
@luederm hmmm, could it be something like, remove AIQC='Not available' unless pipeline is x? Would that help balance things?
We could however most the artifacts are from the older samples so if we did it this way to avoid removing older samples, most of the artifacts would still exist in the dataset.
@jharenza Do you know what happened to the tier 1 variants? The original MAF I provided included 188 tier 1 variants but this data set has none. These are the most important variants to include because they are variants with strong clinical significance.
@jharenza Do you know what happened to the tier 1 variants? The original MAF I provided included 188 tier 1 variants but this data set has none. These are the most important variants to include because they are variants with strong clinical significance.
🤯 @zhangb1 can you clarify exactly the steps from @luederm providing a MAF and VCF, to your team annotating VCF with VEP 105, to adding the new MAF columns, to AIQC filtering, to germline filtering? Then, our team @ewafula had updated column classes since they did not match the KF-generated consensus MAF and per @kelseykeith and @luederm we removed another artifact for KMT2C. I am beginning to think we need to drop DGD SNV from this release in total. Is there a workflow somewhere that can be code reviewed?
I did the filter following the step here : https://github.com/d3b-center/bixu-tracker/issues/1694
run our gencode v39 annotation using the filter
gnomad_3_1_1_AF > 0.001
GNOMAD_AF_HIGH
then
From GENCODE v39 annotated MAF, add columns:
Called_in_Normal?
AIQC
Variant_Tier
Pipeline_Version
Remove variants which are Called_in_Normal? == True
and move these to a dgd-germline.maf.gz
For remaining variants, only keep columns:
AIQC
Variant_Tier
Pipeline_Version
Remove variants which fall into the following:
new_maf <- maf %>%
filter(Classification == NA,
AIQC %in% c("Artifact_v2", "Not_reportable," "Pseudo_region"))
I sent you both protected and public mafs.... not sure which one you were putting in the v12 bucket
Looks like some of them are getting moved to protected based on MAF but others are actually in https://d3b-openaccess-us-east-1-prd-pbta.s3.amazonaws.com/open-targets/v12/snv-dgd.maf.tsv.gz but have tier listed as NA. I think there was an issue mapping the variant tier to the maf. Not sure how it was done but it may be due to the extra letter in tier 1 classifications. All tier ones are either '1A' or '1B' so thinking it possible these were dropped if this column was read in as an int
@zhangb1 is it possible to update the code for this to fix the tiering information? I think there was also another step @zoomzoom1011 performed where we had to alter the germline masking filter... I forget exactly. Is this in a workflow somewhere you can easily update?
@jharenza I was checking my results I sent to you https://d3b.slack.com/archives/DHC31US86/p1674587441054659
in my results did including the '1A' or '1B' in the tiering, I don't know why it get missed.... did you do any filter by yourself or Eric did?
Loaded on to prod, closing!!!
What data file(s) does this issue pertain to?
In the v12 release, we have added new samples listed in the v12 histologies file, and hence we need to update case IDs on PedcbioPortal for new URL links required for updating the MTP snv-frequencies table.
What release are you using?
NOTE: please exclude DGD DNA from case id lists