AlphaGenes / AlphaPeel

AlphaPeel: calling, phasing, and imputing genotype and sequence data in pedigrees
MIT License
2 stars 11 forks source link

Inconsitent documentation of output files #8

Closed Npaffen closed 1 year ago

Npaffen commented 1 year ago

image

As one can see in the picture the documentation is contradictory in 4.1 Phase file. Since the explanation and the example contradict each other. Could you please state whether the decscription or the example shows the correct haplotype order?

gregorgorjanc commented 1 year ago

Good point @Npaffen!

Can you have a look in the code where the phase file is written out (or before that) so we can both figure out / check and fix the docs?

Npaffen commented 1 year ago

Tbh I can try to dig into the code but I did not dive in that deeply yet. Are you sure it isn't faster if you try first and I will give it a try later when I'm back home?

Npaffen commented 1 year ago

From InputOutput.py

def writeIdnIndexedMatrix(pedigree, matrix, outputFile):
    np.set_printoptions(suppress=True)
    print("Writing to ", outputFile)
    with open(outputFile, 'w+') as f:

        for idx, ind in **pedigree.writeOrder():**
            if len(matrix.shape) == 2 :
                tmp = np.around(matrix[ind.idn, :], decimals = 4)
                f.write(' '.join(map(str, tmp)))
                # f.write('\n')
            if len(matrix.shape) == 3 :
                for i in range(matrix.shape[1]) :
                    f.write(idx + " ")
                    tmp2 = map("{:.4f}".format, matrix[ind.idn,i, :].tolist())
                    tmp3 = ' '.join(tmp2)
                    f.write(tmp3)
                    f.write('\n')
                # f.write('\n')
def writeFamIndexedMatrix(pedigree, matrix, outputFile):
    np.set_printoptions(suppress=True)
    print("Writing to ", outputFile)
    with open(outputFile, 'w+') as f:

        for fam in **pedigree.getFamilies():**
            if len(matrix.shape) == 2 :
                tmp = np.around(matrix[fam.idn, :], decimals = 4)
                f.write(' '.join(map(str, tmp)))
                # f.write('\n')
            if len(matrix.shape) == 3 :
                for i in range(matrix.shape[1]) :
                    f.write(str(fam.idn) + " ")
                    tmp2 = map("{:.4f}".format, matrix[fam.idn,i, :].tolist())
                    tmp3 = ' '.join(tmp2)
                    f.write(tmp3)
                    f.write('\n')
                # f.write('\n')

If I get this correctly this points us to the Pedigree.py file From Pedigree.py :

def writeOrder(self):
        if self.writeOrderList is None:
            inds = [ind for ind in self if (not ind.dummy) and (self.args.writekey in ind.fileIndex)]
            self.writeOrderList = sorted_nicely(inds, key = lambda ind: ind.fileIndex[self.args.writekey])

            if not self.args.onlykeyed:
                indsNoDummyNoFileIndex = [ind for ind in self if (not ind.dummy) and (not self.args.writekey in ind.fileIndex)]
                self.writeOrderList.extend(sorted_nicely(indsNoDummyNoFileIndex, key = lambda ind: ind.idx))

                dummys = [ind for ind in self if ind.dummy]
                self.writeOrderList.extend(sorted_nicely(dummys, key = lambda ind: ind.idx))

        for ind in self.writeOrderList :
            yield (ind.idx, ind)

Looking into _sortednicely. Getting closer I guess:

# Not sure of the code source: https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/
# Slightly modified.
import re 
def sorted_nicely( l , key): 
    """ Sort the given iterable in the way that humans expect.""" 
    convert = lambda text: int(text) if text.isdigit() else text 
    alphanum_key = lambda k: [ convert(c) for c in re.split('([0-9]+)', str(key(k))) ] 
    return sorted(l, key = alphanum_key)

Did not really help imho. Let's try the getFamilies()

def getFamilies(self, rev = False) :
        if self.generations is None:
            self.setUpGenerations()
        if self.families is None:
            self.setupFamilies()

        gens = range(0, len(self.generations))
        if rev: gens = reversed(gens)

        for i in gens:
            for family in self.generations[i].families:
                yield family

Looks good, lets see setupFamilies() :

#This is really sloppy, but probably not important.
    def setupFamilies(self) :

        if self.generations is None:
            self.setUpGenerations()

        self.families = dict()
        for ind in self:
            if not ind.isFounder():
               parents = (ind.sire.idx, ind.dam.idx)
                if parents in self.families :
                    self.families[parents].addChild(ind)
                else:
                    self.families[parents] = Family(self.maxFam, ind.sire, ind.dam, [ind])
                    self.maxFam += 1

        for family in self.families.values():
            self.generations[family.generation].add_family(family)

Jackpot! Given that we are interested in the imputation/phasing of the offspring we shoud expect first the sire (father of an animal) and then the dam (mother of an animal). But tbh the comment before the function worries me a little bit :D

Feel free to comment on my logic. This order makes also sense w.r.t. the pedigree input since here the father is also mentioned first. So natural iteration procedure should start with the fathers data.

gregorgorjanc commented 1 year ago

@Npaffen this looks about right and matches all other inputs and outputs! I have fixed the typo hence closing this issue. Let us know if you spot any other inconsistencies! Thanks!