getzlab / deTiN

DeTiN is designed to measure tumor-in-normal contamination and improve somatic variant detection sensitivity when using a contaminated matched control.
BSD 3-Clause "New" or "Revised" License
49 stars 21 forks source link

ExAc pickle file for hg38 #14

Closed erleholgersen closed 6 years ago

erleholgersen commented 6 years ago

Hello,

I've managed to generate all the input files for my samples, and the final thing I'm missing to run deTiN is an ExAc pickle file for hg38. I assume the version you have provided in the example_data folder is for hg19. What's the easiest way to generate such a file for hg38?

I assume I can write a Python script to iterate over the lines in the ExAc VCF, populate a dictionary, and save it as a pickle. Just wondering if there are any easier ways/ pre-written scripts that do this exact thing (I'm not very familiar with tools for processing VCFs in Python, sorry!).

Thanks for any help!

amarotaylor commented 6 years ago

Hey Erle, I included this function in the deTiN utilities package. I just noticed this instruction is in the wrong section of the wiki! Ill move that but have quoted it here

Call the deTiN function deTiN_utilities.build_exac_pickle(vcf) with your vcf file.

MagdalenaZZ commented 6 years ago

Which VCF file is "your vcf file"? Perhaps it is my somatic variant calls, annotated with ExAc "AF=" in the INFO field?

amarotaylor commented 6 years ago

Hi Magdalena,

Sorry for the lack of clarity. The VCF should contain germline SNPs such as the VCF generated by ExAC not a somatic VCF.

MagdalenaZZ commented 6 years ago

Okay, so I try again. The VCF I should run the function on is an ExAC file, such as ExAC.r0.3.1.sites.vep.vcf.gz from ftp://ftp.broadinstitute.org/pub/ExAC_release/current That will give me estimates of how common the variant is in the general population, so I can later identify "germline" variants which are unusual in the general population = potential TIN variants. Is that correct?

MariusGheorghe commented 4 years ago

Hi, Any answers here? Your wiki for the ExAC file is one line which is not explicit enough for anyone. Can you please update your wiki and elaborate a bit more about the sources for ExAC and the required input/output ? Already the input for deTiN seems over demanding and highly specific wrt variant callers. At this point I am not sure it is worth the effort. Thanks

amarotaylor commented 4 years ago

Hey Marius,

You can find VCFs with high frequency germline events here: https://gnomad.broadinstitute.org/downloads The ExAC file is just a VCF with high frequency germline events used to filter out variants. For more on VCFs you can read the documentation. The VCF is most useful when TiN levels are high >20% and germline somatic events are more difficult to distinguish based on DNA read counts.

The input of that function is a single file (the VCF) and the output is a pickle which contains a list of sites to filter out. The numerous inputs to DeTiN are required to build an accurate model and the unfortunately variant callers in the field are numerous with no standard format for their outputs / inputs - I worked with what is commonly used by Gaddy's lab.

As a side bar - being frustrated in GitHub issue threads is not productive and is against the community guidelines.

MariusGheorghe commented 4 years ago

Hi Amaro,

Thanks for your reply and the links. I'm afraid it is still unclear if I should use the file that @MagdalenaZZ pointed to? or I can use that ExAC file directly as input in deTiN. Can you please clarify? Which is the VCF file that should be input in the deTiN_utilities.build_exac_pickle? The one here: ftp://ftp.broadinstitute.org/pub/ExAC_release/current or one from the link you provided? Or I can directly use a file from the link you provided instead of the ExAC file? Thank you in advance for your answers

As a reply to your side bar - frustration can be avoided if proper documentation of the tools is provided. I think that is according to the community guidelines

Marius

amarotaylor commented 4 years ago

Hey Marius

Sure. What genome build are you using? Is it HG38? I can't clarify without knowing the details of your set up.

MariusGheorghe commented 4 years ago

Hi Amaro,

Yes. It is hg38.

Here is the list of files I have prepared so far for the input, so please let me know if that would work:

Thank you for your help.

Marius

amarotaylor commented 4 years ago

Hey Marius,

–-mutation_data_path: from Strelka (runStats.tsv or runStats.xml? i assume the .tsv file)

Yup use the TSV file (though I'm not familiar with Strelka outputs so I don't know what runStats.tsv is). Not sure about what headers Strelka outputs but they should be easy enough to match up. The required column names are listed here.

cn_data_path

The input should be a seg file. I haven't kept up with GATK ACNV since I'm no longer at the Broad they used to output a file with .acs.seg as a suffix. Im not 100% sure which of those files would be the right one. I think they still generate a file similar to this maybe this post would be helpful?

tumor_het_data and normal_het_data

Im not familiar with using VarScan2 as a germline caller but I would convert these VCFs to TSVs with the following headers: CONTIG,POS,REF_COUNT and ALT_COUNT' -- There are no specific tags were looking for there.

–-exac_data_path: missing

This file is not strictly required. If you want to generate your own I would use this VCF and filter for variants with allele fraction > 1%. VCFtools will allow you to do this easily. Once you have done that run the function as described in the wiki.

MariusGheorghe commented 4 years ago

Hi Amaro,

Thank you for all the details.

I will have a look at that post regarding the CNV file. I think the information you provided should be enough to give it another try. So then the --exac_data_path is not a mandatory argument. OK, good to know.

Thank you once again. Maybe this would be helpful for others too if present in the README or Wiki page.

Marius

amarotaylor commented 4 years ago

Hey Marius,

Good point. I will add that to the wiki page regarding the ExAC file and some of the additional details from our discussion.

Thanks Amaro