Adding TopMed supporting data

ACEnglish commented 1 year ago

Hello,

Thank you for building this tool. It is very useful and I expect it will be widely adopted as more people learn about it.

My team is putting together a publication for TopMed about an SV callset over 138,134 samples and their 376,478 variants. The preprint can be found here. We're currently working through the first round of reviews on the manuscript and the question was asked about how users could leverage these results. Our idea was to use SVAFotate for analysis of TopMed and then to add its variants and allele frequencies to SVAFotate's supporting data.

This pull request makes two changes:

Editing setup.py to allow the package to be installable via python3 -m pip install .
Adding TopMed variants to the supporting data.

We couldn't provide population/sex specific information in the supporting data as permissions to phenotype information across the cohorts which comprise this sample set is heterogeneous.

Hopefully, I was able to incorporate the new calls correctly. I have run this updated supporting data file on another cohort's ~17k samples' SVs successfully, but I don't believe that's qualifies as proper functional testing and you may want to test further.

Also, we can imagine this change could drive-up users which might increase the number of issues. Therefore, I'll be monitoring the tickets periodically for ways I can continue to help. If you ever encounter something that you think needs my input, do not hesitate to ask. The last thing I want is to add any undue burden to you. I am more than happy to take responsibility for helping with the continued maintenance of SVAFotate.

Please let me know if you have any questions or anything else you need from me.

Have a great day, ~/Adam English

fakedrtom commented 1 year ago

Adam, this is terrific! I had seen the TOPMED preprint and planned to add these AFs to the SVAFotate when the data became available. I appreciate that you like SVAFotate and thought it would be a good way to help others use some of the TOPMED output. And thanks for taking this initiative! Give me a little bit to review this, but I can't imagine there will be any problems in incorporating your request. Thanks!

fakedrtom commented 1 year ago

Hi Adam, I've been reviewing this and I think this will work. I have used the new BED file and it appears to work as expected. I count 376,478 total TopMed variants (241,631 DELs, 95,311 DUPs, and 39,536 INVs). There are a couple of things that I noticed that might make the TopMed entries more consistent with the other SVs from other datasets. Primarily, it looks like there is a lot of "missing" data for TopMed when we get into the populations and sex differences. This is fine, but it might be best to convert these to 'NA' or 'none' to avoid confusion resulting from annotations indicating a value of 0. For example, currently if someone were to annotate with the option to include AFR populations, a lot of annotations would suggest there are 0 hom_refs, 0 hets, 0 hom_alts, etc. which isn't exactly true. Admittedly, this is probably pretty minor as the main AF is what people will likely be most interested in. I also noticed that in some cases the PopMax_AF exceeds all individual population AFs. Take a look at DEL_1:821037-821907. I see the following population specific AFs listed: 0.023956, 0.0190883, 0.00264151, 0.0416374, and 0.0442616. However, the PopMax_AF is listed as 0.0655488 which is greater than all the individual population AFs. Perhaps this isn't an error and it is using additional samples from other populations that are not listed? Not sure how common this is, but saw it and wondered if there might be an explanation for cases like this. Lastly, there is a final field called InPop that tracks how many populations report the SV. This appears to be 0 for all TopMed SVs and could be corrected. This is very minor and I am not sure even I make use of this field, but it would probably be a pretty easy fix too (either to the count or to 'NA'). I think I am happy to merge the pull and then make these changes myself, but thought I would check with you first in case you wanted to do this considering it is your data?

Thanks again for taking this initiative! I am really excited to add TopMed into SVAFotate!

ACEnglish commented 1 year ago

I asked around and apparently I was wrong about population/sex summary AFs. Give me a few days to calculate these fields, update my vcf->svafotate conversion script, and then redo the conversion. After I push my changes I'll update you via this pull request.

ACEnglish commented 1 year ago

Hello,

I've updated the supporting file. The main change is that it is just TOPMed variants. I will leave it to you on deciding if/how to consolidate with the other sources. You'll also notice some variants have InPop/PopMax of 0. This is because we didn't include summary statistics for variants which only occur in populations with fewer than 100 samples. In hindsight I probably should have consolidated them into the 'OTH', but I chose OTH as simply a substitution to the middle eastern ancestry. For documentation, here's a summary of steps used to create the data.

Added group AF annotations using truvari anno grpaf
Converted VCFs to pandas DataFrames with truvari vcf2df
Used custom script below to fill in missing columns and reformat to SVAFotate's structure

Have a great day, ~/Adam English

import glob
import joblib
import pandas as pd

#SVAFotate header
header = "CHROM START END SVLEN SVTYPE SOURCE SV_ID AF HomRef Het HomAlt Male_AF Male_HomRef Male_Het Male_HomAlt Male_HemiAlt Male_HemiAF Female_AF Female_HomRef Female_Het Female_HomAlt AFR_AF AFR_HomRef AFR_Het AFR_HomAlt AFR_Male_AF AFR_Male_HomRef AFR_Male_Het AFR_Male_HomAlt AFR_Male_HemiAlt AFR_Male_HemiAF AFR_Female_AF AFR_Female_HomRef AFR_Female_Het AFR_Female_HomAlt AMR_AF AMR_HomRef AMR_Het AMR_HomAlt AMR_Male_AF AMR_Male_HomRef AMR_Male_Het AMR_Male_HomAlt AMR_Male_HemiAlt AMR_Male_HemiAF AMR_Female_AF AMR_Female_HomRef AMR_Female_Het AMR_Female_HomAlt EAS_AF EAS_HomRef EAS_Het EAS_HomAlt EAS_Male_AF EAS_Male_HomRef EAS_Male_Het EAS_Male_HomAlt EAS_Male_HemiAlt EAS_Male_HemiAF EAS_Female_AF EAS_Female_HomRef EAS_Female_Het EAS_Female_HomAlt EUR_AF EUR_HomRef EUR_Het EUR_HomAlt EUR_Male_AF EUR_Male_HomRef EUR_Male_Het EUR_Male_HomAlt EUR_Male_HemiAlt EUR_Male_HemiAF EUR_Female_AF EUR_Female_HomRef EUR_Female_Het EUR_Female_HomAlt OTH_AF OTH_HomRefOTH_Het OTH_HomAlt OTH_Male_AF OTH_Male_HomRef OTH_Male_Het OTH_Male_HomAlt OTH_Male_HemiAlt OTH_Male_HemiAF OTH_Female_AF OTH_Female_HomRef OTH_Female_Het OTH_Female_HomAlt SAS_AF SAS_HomRef SAS_Het SAS_HomAlt SAS_Male_AF SAS_Male_HomRef SAS_Male_Het SAS_Male_HomAlt SAS_Male_HemiAlt SAS_Male_HemiAF SAS_Female_AF SAS_Female_HomRef SAS_Female_Het SAS_Female_HomAlt PopMax_AF InPop"
header = header.split(' ')

#VCF INFO to SVAFotate column translation
col_map = {}
with open("column_map.txt", 'r') as fh:
    for line in fh:
        d = line.strip().split('\t')
        col_map[d[0]] = d[1]

#Load dataframes, dedup/rename columns
files = glob.glob("/Users/english/Downloads/dframes/*.jl")
data = (pd.concat([joblib.load(_) for _ in files])
            .drop(columns=["END", "SVTYPE", "SVLEN"])
            .rename(columns=col_map))

#PASS only
data['filter'] = data['filter'].apply(lambda x: x[0])
data = data[data['filter'] == "PASS"].copy()

#Fix chromosome
data['CHROM'] = data["CHROM"].str.replace("chr", "")

#Pull out counts (except X which has nan)
data["HomRef"] = data["GTCNT"].apply(lambda x: x[1] if not isinstance(x, float) else x)
data["Het"] = data["GTCNT"].apply(lambda x: x[2] if not isinstance(x, float) else x)
data["HomAlt"] = data["GTCNT"].apply(lambda x: x[3] if not isinstance(x, float) else x)

#Fix tuple types
for i in [_ for _ in data.columns if "AC" in _] + ["AF"]:
     data[i] = data[i].apply(lambda x: x[0] if isinstance(x, tuple) else x)

#Calculate new columns
data["SOURCE"] = "TOPMed"
data["PopMax_AF"] = data[["AFR_AF", "AMR_AF", "EAS_AF", "EUR_AF", "OTH_AF", "SAS_AF"]].max(axis=1)
data["InPop"] = (data[["AFR_AF", "AMR_AF", "EAS_AF", "EUR_AF", "OTH_AF", "SAS_AF"]] != 0).sum(axis=1)

#Hemi only applies to some of chrX and is just AC_Male/AF_Male
missing_cols = ['Male_HemiAlt', 'AFR_Male_HemiAlt', 'AMR_Male_HemiAlt', 'EAS_Male_HemiAlt', 'EUR_Male_HemiAlt', 
                'OTH_Male_HemiAlt', 'SAS_Male_HemiAlt', 'Male_HemiAF', 'AFR_Male_HemiAF', 'AMR_Male_HemiAF', 
                'EAS_Male_HemiAF', 'EUR_Male_HomRef', 'EUR_Male_HemiAF', 'OTH_Male_HemiAF', 'SAS_Male_HemiAF']
data[missing_cols] = 0
alter_subset = (data["CHROM"] == "X") & (data["N_HEMI_Male"] != 0)
for s_key, d_key in [("", ""), ("AFR_", "_AFR"), ("EAS_", "_EAS"), ("EUR_", "_EUR"), ("OTH_", "_MES"), ("SAS_", "_SAS")]:
    data.loc[alter_subset, s_key + "Male_HemiAlt"] = data[alter_subset]["AC_Male" + d_key]
    data.loc[alter_subset, s_key + "Male_HemiAF"] = data[alter_subset][s_key + "Male_AF"]

#Have to fill in chrX's GT counts
for col in ["HomRef", "Het", "HomAlt"]:
    sub = data[col].isna()
    data.loc[sub, col] = data.loc[sub, "Male_" + col] + data.loc[sub, "Female_" + col]
    # Counts
    data[col] = data[col].astype(int)

#Save
(data[header].sort_values(["CHROM", "START", "END"])
     .rename(columns={"CHROM":"#CHROM"})
     .to_csv("SVAFotate_core_TopMed_SV_popAFs.GRCh38.bed", sep='\t', index=False))

fakedrtom commented 1 year ago

This is great! I have a number of things coming up this week so it may take me a little while to get to this, but I am exited to add this to SVAFotate! Thanks!

fakedrtom commented 1 year ago

Adam, sorry for my delay. Wanted to let you know that I have merged this and am combining the TOPMed SVs with the rest of the core GRCh38 file. I noticed that there are a handful of TOPMed SVs called on the X chromosome that appear to be duplicates of one another so I merged those. This resulted in 376,300 total SVs instead of 376,479. If I missed something and some of those SVs were intentionally duplicated for some reason, please let me know. I plan to push this combined "core" file momentarily and it will replace the existing SVAFotate_core_TopMed_SV_popAFs.GRCh38.bed.gz file with the new, combined one. I'll then update the README's accordingly. Thanks again!

fakedrtom commented 1 year ago

Actually, the combined file looks to be a bit too large, so I may just upload the TOPMed calls themselves and encourage interested people to combine them into their pipelines as they see fit.

fakedrtom commented 1 year ago

Hey @ACEnglish, I think there is a slight error in the file you put together. In your header there is OTH_HomRefOTH_Het instead of those two being tab delimited. I think this is resulting in missing column throughout the file:

zcat SVAFotate_core_TopMed_SV_popAFs.GRCh38.bed.gz | cut -f 78-80 | head
OTH_AF  OTH_HomRefOTH_Het   OTH_HomAlt
0.0 342 0
0.0 340 0
0.0 287 0
0.0 339 0
0.0 321 0
0.0 332 0
0.0 288 0
0.0 311 0
0.0655488   285 0

I am not sure if that second column is the HomRef or the Het or maybe the two have been joined? I suspect it is the Het count and the HomRef is missing. I think if you correct the OTH_HomRefOTH_Het in the first header definition at the start of your code, the problem should be resolved. Could you please let me know?

fakedrtom / SVAFotate

Adding TopMed supporting data #10