GAMES-UChile / Gaussian-Process-Deconvolution

MIT License
2 stars 1 forks source link

Code is not maintainable, not scalable, slow, and not pythonic #1

Open MothNik opened 6 months ago

MothNik commented 6 months ago

Hello,

I'm a co-developer of the package chemotools and for this I wanted to implement a deconvolution of spectra blurred by a lineshape function. I stumbled over your publication Gaussian Process Deconvolution. Since I'm not a native mathematician, I wanted to refer to the code provided for the publication to gain some further understanding. When looking at the code, I've found some possible improvements that could be made:

  1. type hinting: Python allows for type hints to indicate the user which type is expected by a function/method. While these types are not enforced (unlike typed languages such as C, Fortran, or Rust), it makes it way easier to work with the code

  2. comments and docstrings: the code has almost no comments on why something is done, e.g.,

    Gram = Spec_Mix(self.times_y,self.times_y,self.gamma,self.theta,self.sigma) + 1e-8*np.eye(self.Ny)

    where a comment that the 1e-8*np.eye(self.Ny) increases the numerical stability for the following solve could be really helpful. Docstrings for classes, functions, and methods point in a similar direction.

  3. adhering to Python conventions and using descriptive names: some functions are written with capital letters, e.g., Spec_Mix, SE, which is not consistent with the Python conventions. Something similar applies to the class names because they should contain capital letters. On top of that, the naming is often not really descriptive. What does SE do? It computes the squared exponential kernel, then why not name it something like squared_exponential?

  4. removing dead code: a lot of code is commented out and not used anymore.

  5. being explicit: the line

    Gram = self.kernel_sum(self.times_y, self.times_y, params[:-1], weights, self.weight_step)

    could also be written as

    gram = self.kernel_sum(
        x=self.times_y,
        y=self.times_y,
        params=params[:-1],
        weights=weights,
        weight_step=self.weight_step,
    )

    Here, the only but very helpful difference is that the reader does not need to know the implementation of self.kernel_sum and especially the order of its input arguments by heart. Now, everything is clear and due to the assignment to keywords, no order can be confused and lead to wrong output that might slip through unnoticed if the wrong order does not lead to a crash.

  6. formatting and linting: tools like black and ruff help in keeping the code base clean and also for some of the previous points. Setting them up is almost no effort.
    Sticking to 1 - 6 would be the first step towards a state that allows for actually improving the functionality of the code:

  7. many critical parts are painfully slow, even though there are no complex computations: the reason is that they are wasting a significant amount of memory and runtime, e.g.,

    (sign, logdet) = np.linalg.slogdet(K)
    return 0.5*( Y.T@np.linalg.solve(K,Y) + logdet + self.Ny*np.log(2*np.pi))

    where the from-scratch-inversion of K follows the computation of its log-determinant. This could be combined by using a Cholesky factorization or Pivoted LU-decomposition of K and then computing both the log-determinant and the subsequent inversion from it in virtually one go.

    Another example is

    dlogp_dsigma = sigma * 0.5*np.trace(H@dKdsigma)

    A matrix product does never need to be fully formed if only the trace, and thus the main diagonal, has to be computed. Selectively evaluating the matrix product reduces the code to

    dlogp_dsigma = sigma * 0.5*(H * dKdsigma.transpose()).sum()

    which only uses a tiny (almost vanishing) amount of compute compared to the full product.

    Another line offering room for improvement is:

    Gram = Spec_Mix(self.times_y,self.times_y,self.gamma,self.theta,self.sigma) + 1e-8*np.eye(self.Ny)

    where the main diagonal is updated by adding a matrix of mostly zeros. For Ny=100, only 1% of the added elements will be non-zero and this reduces to 0.1% already for Ny=1_000. On top of that, the overhead to initialise a diagonal matrix - where the main diagonal elements are quite badly scattered over the memory by nature - is significant.

    One final example is:

    def outersum(a,b):
         return np.outer(a,np.ones_like(b))+np.outer(np.ones_like(a),b)

    for which NumPy already offers some powerful alternative. This line of code multiplies a vector with a lot of ones to reshape it. This is highly inefficient because the outer sum can also be achieved by using NumPy's broadcasting:

    def outer_sum(a, b):
        return a[::, np.newaxis] + b[np.newaxis, ::]

    Making custom logics for problems that so many others must have had already is both inefficient and also error-prone.

  8. some parts are inconsistent: some methods behave the same, but yet with subtle differences, e.g.,

    def neg_log_likelihood(self):        
        Y = self.y
        Gram = Spec_Mix(self.times_y,self.times_y,self.gamma,self.theta,self.sigma) + 1e-8*np.eye(self.Ny)
        K = Gram + self.sigma_n**2*np.eye(self.Ny)
        (sign, logdet) = np.linalg.slogdet(K)
        return 0.5*( Y.T@np.linalg.solve(K,Y) + logdet + self.Ny*np.log(2*np.pi))
    
    def nlogp(self, hypers):
        sigma = np.exp(hypers[0])
        gamma = np.exp(hypers[1])
        theta = np.exp(hypers[2])*0 #multiply by zero to force SE kernel (and not spectral mixture)
        sigma_n = np.exp(hypers[3])
    
        Y = self.y
        Gram = Spec_Mix(self.times_y,self.times_y,gamma,theta,sigma)
        K = Gram + sigma_n**2*np.eye(self.Ny) + 1e-5*np.eye(self.Ny)
        (sign, logdet) = np.linalg.slogdet(K)
        return 0.5*( Y.T@np.linalg.solve(K,Y) + logdet + self.Ny*np.log(2*np.pi))

    Can someone spot the difference? In neg_log_likelihood, 1e-8 is added to the main diagonal while in nlogp it's even 1e-5. Why is this like this? Wouldn't a class attribute like jitter (as this small summand is sometimes called) help keeping everything identical? Then, the user could even specify it and not rely on a hard-coded 1e-5 or 1e-8 to be a good choice.

This was what I've found from skimming the code, but I don't only want to complain. In contrary, I want to offer my help because in my opinion the code from the examples also massively questions the credibility of the publication, because all these facts make the code hardly maintainable/readable and also difficult to debug (independent of whether the code was meant to be on production quality level or not). As an added bonus, the improvements will increase the audience and allow people to actually make use ot the publication code.

Please let me know if you are interested in a collaboration. Otherwise, I will work in a fork of the repository.

Kind regards

felipe-tobar commented 5 months ago

Dear Niklaz,

many thanks for your input, I greatly appreciate it. I agree the code is not in the best shape for sharing and I appreciate your willingness to get involved. Let me share this with the rest of the team and I will come back to you, some in the team mught be working on an improved version alreeady.

Would you tell us a bit more about yourself? where are you based? what do you work on?

We are currently based at Universidad de Chile, but later this year I will move to Imperial College and the team will be based there.

Best, Felipe.

On 02-06-2024, at 9:08 AM, Niklas Z @.***> wrote:

Hello,

I'm a co-developer of the package chemotools and for this I wanted to implement a deconvolution of spectra blurred by a lineshape function. I stumbled over your publication Gaussian Process Deconvolution. Since I'm not a native mathematician, I wanted to refer to the code provided for the publication to gain some further understanding. When looking at the code, I've found some possible improvements that could be made:

type hinting: Python allows for type hints to indicate the user which type is expected by a function/method. While these types are not enforced (unlike typed languages such as C, Fortran, or Rust), it makes it was easier to work with the code comments and docstrings: the code has almost no comments on why something is done, e.g., Gram = Spec_Mix(self.times_y,self.times_y,self.gamma,self.theta,self.sigma) + 1e-8np.eye(self.Ny) where a comment that the 1e-8np.eye(self.Ny) increases the numerical stability for the following solve could be really helpful. Docstrings for classes, functions, and methods point in a similar direction. adhering to Python conventions and using descriptive names: some functions are written with capital letters, e.g., Spec_Mix, SE, which is not consistent with the Python standards https://peps.python.org/pep-0008/#function-and-variable-names. Something similar applies to the class names because they should contain capital letters. On top of that, the naming is often not really descriptive. What does SE do? It computes the squared exponential kernel, then why not name it something like squared_exponential? removing dead code: a lot of code is commented out and not used anymore. being explicit: the line Gram = self.kernel_sum(self.times_y, self.times_y, params[:-1], weights, self.weight_step) could also be written as gram = self.kernel_sum( x=self.times_y, y=self.times_y, params=params[:-1], weights=weights, weight_step=self.weight_step, ) Here, the only but very helpful difference is that the reader does not need to know the implementation of self.kernel_sum and especially the order of its input arguments by heart. Now, everything is clear and due to the assignment to keywords, no order can be confused and lead to wrong output that might slip through unnoticed if the wrong order does not lead to a crash.

Sticking to 1 - 5 would be the first step towards a state that allows for actually improving the functionality of the code:

many critical parts are slow: the reason is that they are wasting a significant amount of memory and runtime, e.g.,

(sign, logdet) = np.linalg.slogdet(K) return 0.5*( @.**(K,Y) + logdet + self.Nynp.log(2*np.pi)) where the from-scratch-inversion of K follows the computation of its log-determinant. This could be combined by using a Cholesky factorization or Pivoted LU-decomposition of K and then computing both the log-determinant and the subsequent inversion from it in virtually one go.

Another example is

dlogp_dsigma = sigma * @.***) A matrix product does never need to be fully formed if only the trace, and thus the main diagonal, has to be computed. Selectively evaluating the matrix product reduces the code to

dlogp_dsigma = sigma 0.5(H * dKdsigma.transpose()).sum(axis=1) which only uses a tiny (almost vanishing) amount of compute compared to the full product.

Another line offering room for improvement is:

Gram = Spec_Mix(self.times_y,self.times_y,self.gamma,self.theta,self.sigma) + 1e-8*np.eye(self.Ny) where the main diagonal is updated by adding a matrix of mostly zeros. For Ny=100, only 1% of the added elements will be non-zero and this reduces to 0.1% already for Ny=1_000. On top of that, the overhead to initialise a diagonal matrix where the main diagonal elements are quite badly scattered over the memory is significant.

some parts are inconsistent: some methods behave the same, but yet with subtle differences, e.g.,

def neg_log_likelihood(self):
Y = self.y Gram = Spec_Mix(self.times_y,self.times_y,self.gamma,self.theta,self.sigma) + 1e-8*np.eye(self.Ny) K = Gram + self.sigma_n2np.eye(self.Ny) (sign, logdet) = np.linalg.slogdet(K) return 0.5( **@.**(K,Y) + logdet + self.Nynp.log(2*np.pi))

def nlogp(self, hypers): sigma = np.exp(hypers[0]) gamma = np.exp(hypers[1]) theta = np.exp(hypers[2])*0 #multiply by zero to force SE kernel (and not spectral mixture) sigma_n = np.exp(hypers[3])

Y = self.y
Gram = Spec_Mix(self.times_y,self.times_y,gamma,theta,sigma)
K = Gram + sigma_n**2*np.eye(self.Ny) + 1e-5*np.eye(self.Ny)
(sign, logdet) = np.linalg.slogdet(K)
return 0.5*( ***@***.***(K,Y) + logdet + self.Ny*np.log(2*np.pi))

Can someone spot the difference? In neg_log_likelihood, 1e-8 is added to the main diagonal while in nlogp it's even 1e-5. Why is this like this? Wouldn't a class attribute like jitter (as this small summand is sometimes called) help in keeping everything identical? Then, the user could even specify it and not rely on a hard-coded 1e-5 or 1e-8 to be a good choice.

This was what I've found from skimming the code, but I don't only want to complain. In contrary, I want to offer my help because in my opinion the code from the examples also massively questions the credibility of the publication, because all these facts make the code hardly maintainable/readable and also difficult to debug (independent of whether the code was meant to be on production quality level or not). As an added bonus, the improvements will increase the audience and allow people to actually make use ot the publication code.

Please let me know if you are interested in a collaboration. Otherwise, I will work in a fork of the repository.

Kind regards

— Reply to this email directly, view it on GitHub https://github.com/GAMES-UChile/Gaussian-Process-Deconvolution/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACT3KG2RXXBZCNMOBUENTZTZFMKLPAVCNFSM6AAAAABIVBXY4SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZDSNRUHE2TGNI. You are receiving this because you are subscribed to this thread.

MothNik commented 5 months ago

Dear Felipe,

Thanks for your reply! You're welcome and I really want to emphasize that what I wrote was not criticism, but feedback for improvement 🙃

I'm actually an environmental engineer that wanted to do a PhD in environmental process engineering where I focused on bioprocess analytics, mainly by means of spectroscopy. I failed 😵 and ended up as a data scientist at a German startup for infrared spectroscopy in biopharma applications. However, I'm doing the real spectroscopy development for open source software as a hobby in my free time 🧑‍💻 So, whenever you need a practical tester/reader/reviewer, I'm open to help! Since the first day I started doing spectroscopy, I've been struggling so much to deconvolve the Gaussian peak shape that infrared spectra usually can be approximated by in order to enhance the spectral resolution for improving the results and stability of algorithms that rely on the peaks being separated. However, many deconvolution algorithms cannot handle Gaussian shapes well and I even found a publication saying it is the most difficult to deconvolve. So, when I found your publication where you deconvolved the audio signal convolved with a Gaussian peak shape and it produced a faithful Fourier transfrom, my eyes almost popped out 🤩 and I immediately sunk deep into the publication and skimmed the codebase 🤓 I guess, you know the rest of the story 😅

Would you be open for me to port the algorithm to chemotools (your work would of course be cited)? It would be a great addition and I think it fits in there quite nicely because deconvolution is a common task in chemometrics (and I don't refer to the peak decomposition that the community somehow also names deconvolution, but I mean the actual deconvolution as the inversion of convolution). The package builds on top of scikit-learn which already has a Gaussian process regression, so I think it would be quite straightforward to inherit from this class and make use of all the kernels and regression logics already available there.

I would tackle it in baby steps starting with the implementation of a base class that can handle 1D non-blind deconvolution first and then iterating from there over time.

I'm looking forward to your reply and wish you a great start into the weekend!