PacificBiosciences / kineticsTools

Tools for detecting DNA modifications from single molecule, real-time sequencing data
19 stars 21 forks source link

Is there a mistake in script MixtureEstimationMethods.py? #81

Open xyw1 opened 3 years ago

xyw1 commented 3 years ago

Hi authors, I was interested in how ipdSummary gives out the frac value, so I read the source code and I was confused about this function:

image

Why estimateSingleFraction is optimized to estimate mu1 ( which I guess is the mean IPD value if methylated bases)?

obanerjee commented 3 years ago

Hello, Thanks for your interest and for closely looking at ipdSummary.py. I would like to try to help answer your question.

The objective function in the optimization is mixModelFn, defined on line 81 of MixtureEstimationMethods.py:

Return value of mixture model log likelihood function

def mixModelFn(self, p, a0, a1): tmp = (1 - p) a0 + p a1 return -np.log(tmp[np.nonzero(tmp)]).sum()

return -np.ma.log( tmp ).sum()

The input arguments are: p = methylated fraction a0 = value of Gaussian pdf for un-methylated IPDs: N( mu0, sd0 ) a1 = value of Gaussian pdf for methylated IPDs: N( mu1, sd1 )

When fminbound is used to do single-variable optimization for a function like mixModelFn, the optimization is done only over the first input argument – in this case, p. The remaining arguments (a0 and a1) are passed to fminbound as fixed values.

Now, here is estimateSingleFraction, defined in line 96 of MixtureEstimationMethods.py:

The input arguments are: mu1 = mean of log IPDs for methylated case data = observed log IPDs mu0 = mean of log IPDs for un-methylated case L = number of observations in the input data optProp = if set to True, returns optimal mixing proportion.

  1. First, the observed log IPDs and the fixed values of mu1 and mu0 are used to calculate the values of the Gaussian pdf N( mu0, sd0 ) and N( mu1, sd1 )
  2. Now, single-variable optimization is done using the objective function defined in mixModelFn. The values of a0 and a1 are fixed, and the variable is the mixing proportion.

Return optimum argument (mixing proportion) of mixture model log likelihood function.

def estimateSingleFraction(self, mu1, data, mu0, L, optProp=True):

    # NOTE: ignoring the warnings here is sloppy, should be looked at later.
    with np.errstate(all="ignore"):

        a0 = self.replaceScipyNormPdf(data, mu0)
       a1 = self.replaceScipyNormPdf(data, mu1)

        # if f'(0) < 0 (equ. a1/a0 < L), then f'(1) < 0 as well and
        # solution p-hat <= 0
        if np.divide(a1, a0).sum() <= L:
            res = 0.0
        # if f'(1) > 0 (equ. a0/a1 < L), then f'(0) > 0 as well and
        # solution p-hat >= 1
        elif np.divide(a0, a1).sum() <= L:
            res = 1.0
        else:
            # unconstrained minimization of convex, single-variable function
            res = fminbound(self.mixModelFn, 0.01, 0.99,
                            args=(a0, a1), xtol=1e-02)
        if optProp:
            # return the optimal proportion
            return res
        else:
            # return the corresponding log likelihood function value
            return self.mixModelFn(res, a0, a1)

Does this help at all? Please feel free to write back if not.

Thanks again for your interest in this,

From: xuyw notifications@github.com Sent: Monday, November 30, 2020 2:49 AM To: PacificBiosciences/kineticsTools kineticsTools@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [PacificBiosciences/kineticsTools] Is there a mistake in script MixtureEstimationMethods.py? (#81)

[EXTERNAL MESSAGE] Be mindful when clicking links or attachments

Hi authors, I was interested in how ipdSummary gives out the frac value, so I read the source code and I was confused about this function:

[Image removed by sender. image]https://user-images.githubusercontent.com/28685819/100600251-02070180-333c-11eb-81d2-43c4f378619b.png

Why estimateSingleFraction is optimized to estimate mu1 ( which I guess is the mean IPD value if methylated bases)?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/PacificBiosciences/kineticsTools/issues/81, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AANKTE7I4BD62YGDL6J6HN3SSN2DJANCNFSM4UHMSQSA.