LabPresse / BNP-Track

4 stars 1 forks source link

Uncertainty Regarding Output Structure of MAP-tracks in BNP-Track Analysis #5

Open Michael-shannon opened 2 months ago

Michael-shannon commented 2 months ago

Hiya!

I am using the BNP-Track tool and analyzing the output MAT file, with the goal of importing it to a pandas dataframe to continue in python, with the goal of comparing to my existing tracking workflow. I see the clear benefits based on your benchmarking in the paper.

Having generated the MAP-track file and exported it from the GUI, I see that the .mat file contains several variables including t (time), X, Y, and Z. Three things are unclear to me:

  1. What the Z variable represents and how I should handle it. From the paper, its seems like for 2D data like mine, the Z might be an estimated axial position based on shape of the PSF. Is that right and if so, should I apply some filter to remove totally out of focus points, or should I keep them in as tracks? Or alternatively, should I estimate some measure of uncertainty from this Z coordinate, and filter the data based on that?

  2. If the rows of the X, Y and Z variables represent the timepoint (which I suspect they do) then why do all of the tracks have the same (full) number of elements and therefore the same length? Each point seems to be registered in every frame, which is unexpected for me in heterogeneous single-particle tracking (SPT) data, particularly in a small region of interest (ROI) within a cell. Is this because the data has not been filtered/thresholded as per (1)?

  3. There are always 25 columns in each of X, Y and Z, no matter what data I process. I presume the different columns represent different particle tracks, but its strange that there are always the same number. Could it be that there is some arbitrary limit for this, or am I doing something wrong in my mat to pandas function in python? (below).

`def mat_to_pandas(file_path): mat = scipy.io.loadmat(file_path)

# Extract the relevant data fields
x_data = mat['X']
y_data = mat['Y']
z_data = mat['Z']
time_data = mat['t'].flatten()

# Initialize an empty list to collect data rows
data_rows = []

# Iterate over each frame
for frame_idx, time_value in enumerate(time_data):
    for particle_idx in range(x_data.shape[1]):  # Loop over each particle

        # Extract x, y, and z positions
        x_pos = x_data[frame_idx, particle_idx]
        y_pos = y_data[frame_idx, particle_idx]
        z_pos = z_data[frame_idx, particle_idx]

        # Only include valid data (non-NaN or non-placeholder values)
        if not np.isnan(x_pos) and not np.isnan(y_pos):  # Assuming NaN means not tracked
            data_rows.append({
                'x_um': x_pos,
                'y_um': y_pos,
                'z': z_pos,
                'time_s': time_value,
                'particle': particle_idx + 1  # Assign particle ID
            })

# Convert list of rows into a DataFrame
df = pd.DataFrame(data_rows)

return df`

Thanks for anything you can help me with!

p.s. Although I was unable to attend SMLMS this year, I discovered BNP-track through it - it looks like a very promising tool.

lanceXwq commented 1 month ago

Hello Michael,

Thanks for being interested in our work, and I'm sorry for my late response; I had a couple of deadlines last week. Here are my responses to my questions.

  1. What the Z variable represents and how I should handle it. From the paper, its seems like for 2D data like mine, the Z might be an estimated axial position based on shape of the PSF. Is that right and if so, should I apply some filter to remove totally out of focus points, or should I keep them in as tracks? Or alternatively, should I estimate some measure of uncertainty from this Z coordinate, and filter the data based on that?

You're right. BNP-Track infers a particle's z-position based on the shape of its PSF, and that's why the uncertainty in the axial direction is typically higher than that in the lateral direction. It is up to the users how they want to post-process BNP-Track's output, and removing out-of-focus particles based on their z-positions from further analysis is one valid way. The precise definition of "out-of-focus" depends on a million factors; a rough estimate from my personal experience would be 300 nm from the focal plane.

A more accurate approach is to check the uncertainty in the lateral direction by collecting many samples of particle tracks, calculating the credible intervals of particle positions at each frame using quantiles, and then only keeping the tracks with uncertainties below some threshold you choose. However, this approach requires more calculation than the former one. If possible, I recommend doing this calculation once for your experimental setup, understanding what z makes it "out-of-focus," and then using the former approach.

  1. If the rows of the X, Y and Z variables represent the timepoint (which I suspect they do) then why do all of the tracks have the same (full) number of elements and therefore the same length? Each point seems to be registered in every frame, which is unexpected for me in heterogeneous single-particle tracking (SPT) data, particularly in a small region of interest (ROI) within a cell. Is this because the data has not been filtered/thresholded as per (1)?

You are mostly correct. As BNP-Track infers particle tracks in one step rather than performing a more traditional and sequential localize-filter-link-filter tracking, we leave all filtering for potential post-processing, which is discussed to some extent in my response to (1).

  1. There are always 25 columns in each of X, Y and Z, no matter what data I process. I presume the different columns represent different particle tracks, but its strange that there are always the same number. Could it be that there is some arbitrary limit for this, or am I doing something wrong in my mat to pandas function in python?

25 here is the maximum number of particles allowed. chain.params.M stores this value, and you can change it here. Part of these tracks are from actual (active) particles, and the rest are candidate (inactive) tracks that are of no importance in almost all post-processings. The variable marking whether a particle is "active" is called "load" in the manuscript and chain.sample.b in the code. Each sample of b is a boolean vector of length chain.params.M, with true or 1 representing "active".

I hope my responses above help you in some way! As some of your questions are related to post-processing, I am happy to talk to you for more specific examples as well!

Michael-shannon commented 4 weeks ago

Thanks for your response. I understand more now, especially for points 1 and 2. Point 3 is less clear to me, and perhaps I can ask a couple of things that would clarify it further (for me):

Why are there 25 tracks allowed only? How does BNP pick which 25 to save if there are more than 25 tracks? Why are there inactive tracks included in that list of 25 tracks? (relates to 2)

Also, I have a high end PC, and the run took a long time even for 3 tiny 50x50 px ROIs of only 100 frames (cut from the actual data which are 150 x 150 px, 6000 frames). Could I tighten the priors, fix certain parameters and/or calculate some good posteriors based on a subset of data and apply it to a larger dataset? Are these things possible and if so where might i find them in the code?

Thanks for any help! I can also open the speed up as a separate issue if you think its relevant.

lanceXwq commented 2 weeks ago

Hello Michael, apologies for the delayed response; I was tied up with tight deadlines.

Regarding the 25 tracks, 25 is simply an arbitrary upper limit that users can adjust as needed. (I have tested the code with 100 as the limit.) You can modify this limit using the link I shared in my previous message. We initially set 25 as a practical balance---many of the datasets in our paper are based on relatively small fields of view (tens-by-tens pixels), where tracking more than 25 particles becomes impractical. Increasing this limit also raises the computational cost, so 25 was chosen as an empirical trade-off.

If more than 25 real tracks exist (e.g., 30), BNP-Track will attempt to recover the 25 tracks that best explain the observed frames. In such cases, particles near the edge of the field of view may be ignored, and closely spaced particles could result in an averaged track.

The choice to save inactive tracks was made for simplicity, allowing all tracks across samples to be saved in a uniform array. However, I’ve implemented a more efficient approach in an ongoing project that only saves active tracks. I hope to integrate this improvement into the repository in a few months.

Regarding performance, your ideas are valid. Based on my experience, once the correct number of particles has been identified (when active track counts stabilize), I recommend stopping the run, commenting out the sampler updating loads, and resuming from where it was paused. I can double-check with my co-author, who implemented that section, to see if there’s a more streamlined solution.

Another potential strategy is to provide BNP-Track with initial guesses from other tools like TrackMate or u-track. However, I understand this can be challenging, as these tools sometimes generate tracks with missing segments.

In the ongoing project mentioned earlier, I’ve also rewritten major components to offload some computational tasks to the GPU, resulting in (preliminarily) a 50x speedup. I realize this isn’t a perfect answer for now, but improvements are underway---thank you for your patience, and please stay tuned! 😃