DKMS / Hapl-o-Mat

A software for haplotype inference
Other
11 stars 7 forks source link

Handling of differing missing loci not clear #17

Open dondiage opened 3 weeks ago

dondiage commented 3 weeks ago

Hello,

I have HLA genotype data in the form of (dummy example):

A A B B DRB1 DRB1
02:01 02:02 ? ? 03:02 15:01
03:01 02:02 04:01 06:01 02:01 03:03
02:01 02:02 07:02 04:04 ? ?
08:03 ? 01:01 06:01 15:01 02:02

...

As you see, I have sometimes missing data on diverse loci, sometimes full data.

From the documentation, I see that you have the option RESOLVE_MISSING_GENOTYPES and that I need to use the parameterGLS file. But I don't really get the requested format as well as not the obligatory AlleleList.txt file. Needless to say that I want to retain the rows with the partially missing data and that the EM algorithm should just attribute each possibility with an equal weight to resolve those missing genotypes.

Could you please guide me here?

Best, Igor

sauter commented 1 week ago

Hi Igor,

Thank you for your interest in Hapl-o-mat. Yes, I can see that this is not as comfy as one could hope. Generally, Hapl-o-mat can deal with missing loci in the GLS (glid + pull files), only. In a first step, you would need to translate the current "MAC" format of your input into this format. We are currently working on a new release of Hapl-o-mat which will include a script to do so.

To get the AlleleList.txt file, there's a script in the prepareData folder called "BuildAlleleList.py". This script takes the glid-file of your input and creates "AlleList.txt" that you then can move into the data folder. This allows Hapl-o-mat to use all potential alleles of the input file as a guess for the missing data. We will also add an option, that will allow you to use any allele here.

Please be aware, that missing loci in many cases will result in a high number in potential genotypes per sample, potentially making them so low-frequent, that the sample will be removed by virtue of the "MINIMAL_FREQUENCY_GENOTYPES" parameter, anyway.

Then, Hapl-o-mat cannot in it's current version handle missing alleles, only loci. We will address this issue also in the upcoming release.

Best, ~J

dondiage commented 1 week ago

Hello Jürgen,

Thank you for your detailed answer! I will try to concert it then to GLS via the mentioned "BuildAlleleList.py" script. Besides, I'll wait for the next release with curiosity!

Best, Igor