IgDAWG / BIGDAWG

2 stars 0 forks source link

Phasing with Bigdawg? #2

Open rwanwork opened 2 years ago

rwanwork commented 2 years ago

Hi,

This is just a question and not an "issue"... I was wondering if if Bigdawg is able to do phasing for subsequent imputation using other programs.

I noticed in the 2016 Human Immunology paper that Bigdawg accepts "unphased alleles for each locus". So, I was wondering if phasing is one of its functions. I looked through the manuscript as well as the PDF manual for the R software and could not find anything. So, I guess not but I just want to make sure.

Thank you!

IgDAWG commented 2 years ago

Hi @rwanwork

Yes. BIGDAWG will generate phased data (aka, haplotypes) when Run.Tests="H". If Loci.Set is not specified then phasing will be attempted for all loci in the unphased data set. In cases like this, it is important to set the Missing parameter, as large amounts of missing data will results in long run-times.

BIGDAWG will report two phased haplotypes for each subject, assigned via posterior probabilities, in the haplotype_HapsbySubject.txt file, for each locus set.

-- Steve Mack

rwanwork commented 2 years ago

Hi Steve,

Thanks a lot for your prompt reply! Now I see where I was wrong -- I was busy looking for the word "phased" in the documentation. Thank you for the clarification!

I tried the sample data and command from the documentation:

BIGDAWG(Data="HLA_data", Run.Tests="H", Missing=0, Loci.Set=list(c("DRB1","DQB1")))

Looking at both the input and the generated output in the haplotype_HapsbySubject.txt, I get the feeling that it phases / generates haplotypes after imputing using another program. That is, it takes as input HLA types at each loci and generates haplotypes for me.

If I were to phase at the SNP-level prior to imputing the HLA region, and then generate haplotypes using BIGDAWG, would that be a bad idea? This feels like phasing twice (i.e., before and after imputation). Would you consider this overkill or perfectly fine and even recommended? (Or would it be better to skip the first phasing step and just use BIGDAWG after?)

Thank you for your advice and sorry for such basic questions. I've still got a lot to learn...

Ray

IgDAWG commented 2 years ago

Hi Ray,

So for background, we initially developed BIGDAWG for analyzing HLA genotype data generated using molecular genotyping methods. The haplotype estimation that BIGDAWG does is a standard EM approach for identifying multi-locus haplotypes. You can give BIGDAWG SNP data, and it will identify SNP haplotypes, but it will not impute HLA allele names from SNP haplotypes. You can use BIGDAWG on mixed HLA and SNP genotypes to e.g. analyze larger sections of the HLA region.

So, given that, I would say that it is fine for you to impute your HLA alleles from HLA-region SNPs (using something like HIBAG), and then use BIGDAWG to impute multi locus HLA haplotypes.

The caveat with HLA imputation is that molecular genotyping is generally more accurate than imputation (albeit more expensive than SNP genotyping) especially for less common alleles and in less well studied populations (e.g., it is difficult to impute 3- or 4-field HLA allele names with high accuracy).

The caveat with EM haplotype estimation is EM haplotypes with low counts (generally 3 or less) are generally unreliable (if you run the estimation multiple times, you will find that many of those rare haplotypes are not repeated), so you want to avoid interpretations based on rare haplotypes.

So given all of that, I would recommend that you evaluate multi locus HLA haplotypes generated by BIGDAWG for SNP-imputed HLA alleles against published multi locus HLA haplotypes derived via molecular methods for that population (or a related one). This isn't a recommendation specific to BIGDAWG; I would recommend this for haplotypes generated from SNP-imputed HLA alleles generated via any means. The tool I would recommend for that evaluation is HLAHapV, which is fairly comprehensive in flagging usual haplotypes. For full disclosure, I played a small role in developing HLAHapV.

-- Steve

rwanwork commented 2 years ago

Hi Steve,

Thank you for your detailed reply! Indeed, HIBAG was what I had in mind since it takes unphased data as input. After seeing the output, I was wondering what I should do next (or if I should have done something before). And this led me to Bigdawg. But I wasn't sure how the two fit together.

Your explanation was very helpful!

Thanks a lot for your time and your suggestion about HLAHapV. I will take a look at that software as well!

Things are much clearer now and I don't have further questions. Thank you!

Ray