Bix Dev PedcbioPortal Support: OpenPedCan v12 on pedcbio

ewafula commented 1 year ago

What data file(s) does this issue pertain to?

In the v12 release, we have added new samples listed in the v12 histologies file, and hence we need to update case IDs on PedcbioPortal for new URL links required for updating the MTP snv-frequencies table.

What release are you using?

v12
Please find the relevant files in this S3 bucket - s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v12/

NOTE: please exclude DGD DNA from case id lists

migbro commented 1 year ago

The short background is that I normally try to use cBio IDs I have already made to make openpedcan comparable to existing studies on cBio. A set of IDs briefly gave me naming trouble, likely because on data service that are marked as visible false. This is what I found: in OPC visible false @jharenza most are DGD samples, but two are PBTA. I am not sure how to handle. I can ignore and handle the remaining sample naming issues I have (just the two PBTA ones) manually, or if anyone knows of why they should/should not be hidden, that'd be nice.

migbro commented 1 year ago

Also I should mention that I know that the histologies file isn't 100% ready yet, but using sort of a "pre-release" version to head off possible blockers, which seems to be paying off already

jharenza commented 1 year ago

The short background is that I normally try to use cBio IDs I have already made to make openpedcan comparable to existing studies on cBio. A set of IDs briefly gave me naming trouble, likely because on data service that are marked as visible false. This is what I found: in OPC visible false @jharenza most are DGD samples, but two are PBTA. I am not sure how to handle. I can ignore and handle the remaining sample naming issues I have (just the two PBTA ones) manually, or if anyone knows of why they should/should not be hidden, that'd be nice.

I have no idea why these are hidden - I see some slack from 2 years ago where the PBTA samples might have been hidden just bc they didn't have an external_aliquot_id, but we had them in V11 and I don't see any reason to hide. cc @youngnm @baileyckelly

For DGD, I also have no idea why some DGD would be hidden - @nicholasvk would you have insight into this? As far as I know, we should retain.

migbro commented 1 year ago

Just making a note of an edge case - sample 7316-6779_DGD_45 is in the histologies file for having a maf and fusion, but there are no fusions in the fusion-dgd.tsv.gz, and the two variant calls are synonymous/intronic. For now, I will remove from the gene panel file, as a variant calls gas to exist in the maf file, and try to code a way to deal with it later.

chinwallaa commented 1 year ago

just to clarify - FNL identified some artifacts w/ the DGD SNV frequencies tables during the pre-release QC, so for v12 the decision was to exclude the DGD from the SNV tables (thus no link to cbioportal in that table for this data set). The DGD data are being used for fusion and molecular subtyping. @ewafula @jharenza

migbro commented 1 year ago

Ok other issues to be resolved: There are repeat breakpoints in the fusion file. This is not allowed in cBio. For example:

BS_00FD2KMP     6:158818082     6:4431475       EZR--ENSG00000285424    frameshift      EZR     NA      ENSG00000285424 NA      CosmicCensus    NA      NA      NA      NA      ARRIBA  1       NA      NA      FALSE   PT_SW4Q1HZP     [INTRACHROMOSOMAL[chr6:154.18Mb]], deletion     Genic
BS_00FD2KMP     6:158818082     6:4431475       EZR--ENSG00000285424    other   EZR     NA      ENSG00000285424 NA      CosmicCensus    NA      NA      NA      NA      STARFUSION      1       NA      NA      FALSE   PT_SW4Q1HZP     [INTRACHROMOSOMAL[chr6:154.18Mb]]       Genic
BS_028YFYJ6     3:53846648      3:39169937      IL17RB--ENSG00000284669 frameshift      IL17RB  NA      ENSG00000284669 NA      Oncogene        NA      NA      NA      NA      ARRIBA  1       NA      NA      FALSE   PT_394ZA6P7     [INTRACHROMOSOMAL[chr3:14.67Mb]], duplication   Genic
BS_028YFYJ6     3:53846648      3:39169937      IL17RB--ENSG00000284669 other   IL17RB  NA      ENSG00000284669 NA      Oncogene        NA      NA      NA      NA      STARFUSION      1       NA      NA      FALSE   PT_394ZA6P7     [INTRACHROMOSOMAL[chr3:14.67Mb]]        Genic

DGD maf has NA values for some of the numeric fields, like t_ref_count, n_alt_count. It seems also that Mutation_Status must have some kind of controlled vocab, as an error is thrown for NA values. I think instead of NA, blanks are preferred. DGD fusion just needs a lot of review - I think wither the format changed from v11 to v12, or the test data I was given was outdated, so now my ETL doesn't work for it.

luederm commented 1 year ago

Looking into a number of the DGD samples and the ones I looked into are all in the event link table and have been approved by DGD GCs for transfer. So it doesn't look like a consent issue. I think Nick will have to see why they ended up getting marked as not visible.

migbro commented 1 year ago

Ok, thanks! That helps. It's kinda weird that those NA values are in the openpedcan maf, but seemingly not in the individual mafs I used to load for pbta_all, which is in the bix_genomics_file."dgd_maf-genomics_file_manifest" table. Why is that @luederm ? Was it a like a pandas merge thing? Would it be possible to regen with using blank strings when a field is empty, or does OpenPedCan require these NA values?

migbro commented 1 year ago

@chinwallaa I guess for the sake of time, unless someone can correct this, I want to reiterate this situation in the fusion-putative-oncogenic.tsv:

BS_00FD2KMP     6:158818082     6:4431475       EZR--ENSG00000285424    frameshift      EZR     NA      ENSG00000285424 NA      CosmicCensus    NA      NA      NA      NA      ARRIBA  1       NA      NA      FALSE   PT_SW4Q1HZP     [INTRACHROMOSOMAL[chr6:154.18Mb]], deletion     Genic
BS_00FD2KMP     6:158818082     6:4431475       EZR--ENSG00000285424    other   EZR     NA      ENSG00000285424 NA      CosmicCensus    NA      NA      NA      NA      STARFUSION      1       NA      NA      FALSE   PT_SW4Q1HZP     [INTRACHROMOSOMAL[chr6:154.18Mb]]       Genic
BS_028YFYJ6     3:53846648      3:39169937      IL17RB--ENSG00000284669 frameshift      IL17RB  NA      ENSG00000284669 NA      Oncogene        NA      NA      NA      NA      ARRIBA  1       NA      NA      FALSE   PT_394ZA6P7     [INTRACHROMOSOMAL[chr3:14.67Mb]], duplication   Genic
BS_028YFYJ6     3:53846648      3:39169937      IL17RB--ENSG00000284669 other   IL17RB  NA      ENSG00000284669 NA      Oncogene        NA      NA      NA      NA      STARFUSION      1       NA      NA      FALSE   PT_394ZA6P7     [INTRACHROMOSOMAL[chr3:14.67Mb]]        Genic

Note that these are the exact same fusions, exact same breakpoints. Only difference is the Fusion_Type column differs, but seems more accurate in the ARRIBA call. Therefore, they should have been merged into one line and caller.count should be 2, know what I mean? If there is a way to merge this properly, that would free up my time to work out the DGD fusion format issue I encountered. Otherwise, me addressing both issues may make loading this project "ASAP" more difficult.

migbro commented 1 year ago

Also, this is just 2 examples, the validator says this happens 851 times

chinwallaa commented 1 year ago

Thanks @migbro , is there a reason for these 851 we dont just use the ARIBA calls vs the STAR FUSION calls ? @jharenza @ewafula what is the consenus logic for fusion calls - use ARIBA if exists, and if not use STARFUSION - did the logic change from v11 to v12 ?

jharenza commented 1 year ago

@migbro I think you had collapsed these fusions in the latest pbta_all load- can you just do the same here?

migbro commented 1 year ago

Well, I collapsed using junction read counts. Also, it seems you already attempt to do this: BS_00FD2KMP 3:160356013 14:67809289 IFT80--ZFYVE26 in-frame IFT80 NA ZFYVE26 NA NA Oncogene NA NA NA STARFUSION, ARRIBA 2 NA NA FALSE PT_SW4Q1HZP [INTERCHROMOSOMAL[chr3--chr14]], translocation Genic For example there, your group collapsed. All I am saying is you're kind of asking me to collapse again to fix this. Which I can do, it just slows things down and feels like yet another band-aid.

jharenza commented 1 year ago

@chinwallaa there was no change on our end, this is a new update to pedcbio which now requires / includes breakpoints causing some issues - @migbro if you can outline exactly what changes need to happen to putative oncogenic to get it into pedcbio load shape in a new ticket, someone can work on it.

migbro commented 1 year ago

Sure, I can make a ticket as the current method seems to fall short on collapsing on common breakpoints as your deliverable. It'll probably benefit downstream uses anyway as I'd think repeat breakpoints would require users to account for that to prevent double-counting, etc.

migbro commented 1 year ago

@chinwallaa @jharenza created this ticket here: https://github.com/PediatricOpenTargets/ticket-tracker/issues/533 Please assign whomever you believe is best suited

migbro commented 1 year ago

A positive update, I was able to resolve:

DGD fusion just needs a lot of review - I think wither the format changed from v11 to v12, or the test data I was given was outdated, so now my ETL doesn't work for it.

For this:

DGD maf has NA values for some of the numeric fields, like t_ref_count, n_alt_count. It seems also that Mutation_Status must have some kind of controlled vocab, as an error is thrown for NA values. I think instead of NA, blanks are preferred.

@chinwallaa I can't tell if your comment:

just to clarify - FNL identified some artifacts w/ the DGD SNV frequencies tables during the pre-release QC, so for v12 the decision was to exclude the DGD from the SNV tables (thus no link to cbioportal in that table for this data set). The DGD data are being used for fusion and molecular subtyping.

Was to encourage dropping DGD snvs for this load...otherwise either I replace NA in those fields with blank strings, or @luederm recreates the maf and does as such

luederm commented 1 year ago

Ok, thanks! That helps. It's kinda weird that those NA values are in the openpedcan maf, but seemingly not in the individual mafs I used to load for pbta_all, which is in the bix_genomics_file."dgd_maf-genomics_file_manifest" table. Why is that @luederm ? Was it a like a pandas merge thing? Would it be possible to regen with using blank strings when a field is empty, or does OpenPedCan require these NA values?

The MAFs from the dgd_maf-genomics_file_manifest were generated from VCF files by Bo. I am not sure how the OpenPedCan was generated. The last version of the MAF I provided is here: https://github.com/d3b-center/bixu-tracker/issues/1694#issuecomment-1397204620. I can remove the NA values. Just n_ref_count and n_alt_count?

migbro commented 1 year ago

@luederm :

Mutation_Status
n_ref_count
n_alt_count
t_ref_count

Thanks!

jharenza commented 1 year ago

Actually the current maf is in s3, v12 folder - there was an issue with the maf bo and team generated which had non matching column classes and we had to remove more artifacts. Please use what's in the v12 bucket

jharenza commented 1 year ago

@luederm @migbro any further analysis on DGD won't be included in v12 - we need a process for after v12 which may be running it from scratch through D3b workflows as discussed in slack with @aadamk

luederm commented 1 year ago

Okay, so no need to remove the NAs for now?

migbro commented 1 year ago

Just took a look - it's the same maf, so please remove NA

jharenza commented 1 year ago

ok, @luederm if you can remove NA from https://d3b-openaccess-us-east-1-prd-pbta.s3.amazonaws.com/open-targets/v12/snv-dgd.maf.tsv.gz for miguel, that is the latest maf

migbro commented 1 year ago

Also, a quick question hopefully @luederm , from this merged maf, aside from the usual synonymous/intergenic, etc filtering, should I use any of the following added info to remove variants from loading on to cBio (numbers are just the instance count of each value):

AIQC

    240 Indel
    300 Indel_v2
  28932 Not available
    500 OK
      3 OK_screener
    575 OK_v2
    117 Uncertain
      2 Uncertain_feature_missing
   1553 Uncertain_v2

Variant_Tier

Pipeline_version

2951 Early 2016
   2521 SomaticPanel-v0.1.0
    686 SomaticPanel-v1.0.0
   1485 SomaticPanel-v1.0.2
    681 SomaticPanel-v3.0.2
     71 SomaticPanel-v3.1.0
   2362 SomaticPanel-v3.2.0
    208 SomaticPanel-v3.3.0
    842 SomaticPanel-v3.3.1
    115 SomaticPanel-v3.3.2
    161 SomaticPanel-v3.3.3
     44 SomaticPanel-v3.4.0
   1305 SomaticPanel-v3.4.1
    291 SomaticPanel-v3.4.2
    693 SomaticPanel-v3.4.3
    122 SomaticPanel-v3.4.4
    376 canto-1.0
   4358 canto-2.0
   2247 canto-2.1
   1735 canto-2.2
   2970 canto-2.3
   3488 canto-2.4
   2510 canto-2.5

luederm commented 1 year ago

If we want to limit artifacts, we could filter out AIQC='Not available'. This will remove a large majority of the artifacts, but will completely remove older samples and will remove variants that were filtered before AIQC (variants with high MAF and synonymous variants with no HGMD/COSMIC class).

luederm commented 1 year ago

snv-dgd.na_removed.maf.tsv.gz

migbro commented 1 year ago

If we want to limit artifacts, we could filter out AIQC='Not available'. This will remove a large majority of the artifacts, but will completely remove older samples and will remove variants that were filtered before AIQC (variants with high MAF and synonymous variants with no HGMD/COSMIC class).

@luederm hmmm, could it be something like, remove AIQC='Not available' unless pipeline is x? Would that help balance things?

migbro commented 1 year ago

snv-dgd.na_removed.maf.tsv.gz

Almost there - there is an extra index column it seems in this file @luederm

luederm commented 1 year ago

Here it is with the index removed snv-dgd.na_removed.maf.tsv.gz

@luederm hmmm, could it be something like, remove AIQC='Not available' unless pipeline is x? Would that help balance things?

We could however most the artifacts are from the older samples so if we did it this way to avoid removing older samples, most of the artifacts would still exist in the dataset.

@jharenza Do you know what happened to the tier 1 variants? The original MAF I provided included 188 tier 1 variants but this data set has none. These are the most important variants to include because they are variants with strong clinical significance.

jharenza commented 1 year ago

@jharenza Do you know what happened to the tier 1 variants? The original MAF I provided included 188 tier 1 variants but this data set has none. These are the most important variants to include because they are variants with strong clinical significance.

🤯 @zhangb1 can you clarify exactly the steps from @luederm providing a MAF and VCF, to your team annotating VCF with VEP 105, to adding the new MAF columns, to AIQC filtering, to germline filtering? Then, our team @ewafula had updated column classes since they did not match the KF-generated consensus MAF and per @kelseykeith and @luederm we removed another artifact for KMT2C. I am beginning to think we need to drop DGD SNV from this release in total. Is there a workflow somewhere that can be code reviewed?

zhangb1 commented 1 year ago

I did the filter following the step here : https://github.com/d3b-center/bixu-tracker/issues/1694

run our gencode v39 annotation using the filter

gnomad_3_1_1_AF > 0.001
GNOMAD_AF_HIGH

then

From GENCODE v39 annotated MAF, add columns:
```
Called_in_Normal?
AIQC
Variant_Tier
Pipeline_Version
```
Remove variants which are Called_in_Normal? == True and move these to a dgd-germline.maf.gz
For remaining variants, only keep columns:
```
AIQC
Variant_Tier
Pipeline_Version
```

Remove variants which fall into the following:

new_maf <- maf %>%
 filter(Classification == NA,
     AIQC %in% c("Artifact_v2", "Not_reportable," "Pseudo_region"))

I sent you both protected and public mafs.... not sure which one you were putting in the v12 bucket

luederm commented 1 year ago

Looks like some of them are getting moved to protected based on MAF but others are actually in https://d3b-openaccess-us-east-1-prd-pbta.s3.amazonaws.com/open-targets/v12/snv-dgd.maf.tsv.gz but have tier listed as NA. I think there was an issue mapping the variant tier to the maf. Not sure how it was done but it may be due to the extra letter in tier 1 classifications. All tier ones are either '1A' or '1B' so thinking it possible these were dropped if this column was read in as an int

jharenza commented 1 year ago

@zhangb1 is it possible to update the code for this to fix the tiering information? I think there was also another step @zoomzoom1011 performed where we had to alter the germline masking filter... I forget exactly. Is this in a workflow somewhere you can easily update?

zhangb1 commented 1 year ago

@jharenza I was checking my results I sent to you https://d3b.slack.com/archives/DHC31US86/p1674587441054659

in my results did including the '1A' or '1B' in the tiering, I don't know why it get missed.... did you do any filter by yourself or Eric did?

migbro commented 1 year ago

Loaded on to prod, closing!!!

d3b-center / ticket-tracker-OPC

Bix Dev PedcbioPortal Support: OpenPedCan v12 on pedcbio #509

What data file(s) does this issue pertain to?

What release are you using?