Proposing a better MSD calculation

Zatyrus commented 3 years ago

While coding my own project I came up with an own version of an MSD calculation that runs significantly faster and tends to be more stable than the one currently implemented.

Used from TRAIT2D track.calculate_msd()

Proposed MSD Calculation

from tqdm import tqdm
import numpy as np

def My_MSD(data):
    N = len(data)

    col_Array  = np.zeros(N-3)

    for i in tqdm(range(1,N-2)):
        col_Array[i-1] = (np.mean(np.sum(np.abs((data[1+i:N,:] - data[1:N - i,:]) ** 2), axis=1)))

    return col_Array

Analyzing the attached 20000 steps Brownian Diffusion CSV file, the legacy MSD takes about 8.4 seconds and tends to crash for larger tracks. The proposed method takes about 1.1 seconds and runs stable even for tracks as large as 100000 steps (43 seconds).

Please also view the attached VS_plot.png for result verification created for the Brownian_Diff_20000.csv Dataset.

MSD_VS_Plot Brownian_Diff_Test_20000.csv Brownian_Diff_Test_100000.csv

john-wigg commented 3 years ago

I did some profiling on my own and you're absolutely right. I thought we could gain some performance by using numpy's optimised matrix operations instead of manually looping but this naturally causes the memory requirements to explode (kind of obvious in hindsight), which is probably causing the crashes as well for bigger tracks.

As the proposed method behaves exactly the same as the old one but is way faster I don't see any reason against changing it. If you want you can make the changes and create a pull request yourself.

This is the section that would need changing: https://github.com/Eggeling-Lab-Microscope-Software/TRAIT2D/blob/27e61ba398dab0351c5259fb16273fbdf7d3bcda/trait2d/analysis/__init__.py#L995-L1015

On another note, this might be good opportunity to maybe write some unit tests so we can automatically check whether new methods like this work correctly. I don't have much time right now but maybe I'll get around to it when I work on the other issues.

Below is the output of the attached notebook I used for profiling:

msd_test_notebook_inside.zip

%load_ext memory_profiler

from trait2d.analysis import Track

track = Track.from_file("Brownian_Diff_Test_20000.csv", unit_length='micrometres')

%timeit track.calculate_msd()
%memit track.calculate_msd()

8.72 s ± 196 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
peak memory: 9300.71 MiB, increment: 9186.85 MiB

import numpy as np
def My_MSD(data):
    N = len(data)

    col_Array  = np.zeros(N-3)

    for i in range(1,N-2):
        col_Array[i-1] = (np.mean(np.sum(np.abs((data[1+i:N,:] - data[1:N - i,:]) ** 2), axis=1)))

    return col_Array

import pandas as pd
data = np.array(pd.read_csv("Brownian_Diff_Test_20000.csv")[["x", "y"]])

%timeit My_MSD(data)
%memit new = My_MSD(data)

995 ms ± 24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
peak memory: 115.93 MiB, increment: 0.04 MiB

np.sum(np.abs(track.get_msd() - new))

6.740269893435732e-11

FReina commented 2 years ago

Will close this when the relative PR is merged

FReina commented 2 years ago

This has been now implemented.

Eggeling-Lab-Microscope-Software / TRAIT2D

Proposing a better MSD calculation #52