First (straight line) hough transform

asalzburger commented 3 years ago

Implement a first straight line based hough transform, e.g. from sci-kit learn:

https://scikit-image.org/docs/0.3/auto_examples/plot_hough_transform.html

[x] try detect the straight lines in the r-z plane and compare to ground truth
[x] define efficiency/fake/duplicate rate (for the moment, hopefully only efficiency matters)
[x] think of expanding the straight-line hough transform to a helical conformal mapping transform.

Further refinements:

[x] write at purely truth based efficiency function (1)
[x] implement the book-keeping

(1) Truth based efficiency function

Given a set of hits found [f], look up how many of the hits are actually from the same particle using the truth association. I.e. every hit in the production file has an identifier, which particle produced it.

every hit has weight, assume weight 1
if multiple particles are present, calculate fraction of all particles, and take leading particle sum up the weight of the hit w_i for every particle / truth sum, results in a truth probability function in [0,1)

Updates:

[x] Plot a histogram of "matching probability"
[x] Calculate the efficiency per tracks, efficiency vs eta, efficiency vs pT
[x] Start writing a selection/classification function (2)

(2) So far, you are taking the 25 best tracks per event, because you know (falsely) that there are 25 particles produced.

To overcome this, we use selection criteria without looking at truth information:

minimum number of hits
holes
shared hits, are the hits already used by another particle
compatibility with track model, in general via a least squares estimator, yields a chi2
- ML: trained classifier

AndrewSpano commented 3 years ago

Task 1

The idea is to start with single-particle events and then move to many-particle events. Let's take a look at a single particle event:

Now we have to "take" those points to the Hough space (aka parameter space), by using the following transformation: y = mx + b <=> b = -xm + y. This basically converts each point to a line, like so:

We can then try to find the intersection points of all those lines, and the "most common" one will coefficients where the points align best in the original space. This yields:

Now it's time to move to many-particle events. Let's take a look at their tracks:

Their corresponding lines in the Hough space will look like this:

It's a bit messy. Maybe finding exact intersections will not yield accurate results. Let's give it a shot:

Indeed the results are not very good. The rounding error introduced in order to count close lines as similar, is way too big. This can be solved by introducing the method of binning. We will discretize the 2D space into small bins, which all accumulate scores (votes) for every intersection that lies inside them. Each bin will represent a pair of (m, b) parameters. After the "voting procedure", we will select the bins with the top 24 values. This will yield:

which looks like it fits the data way better than the previous method. By looking at it with the eye, it looks like only 3 tracks were not found.
Task 2

Firstly, we have to define when two tracks are approximately equal. Two tracks will be equal when their lines are close to each other. Of course, a small threshold will be allowed. The function that does this is this one:
- For the efficiency rate, we define it as the percentage of true (ground truth) tracks that match with at least any (non-previously-matched) estimated track. With this, the efficiency rate we get is 0.75:
- For the fake rate, we define it as the percentage of estimated tracks that are nowhere close to any true track. With this, we get a fake rate of 0:
- For the duplicate rate, we define it as the percentage of estimated tracks that are similar to each other (i.e. really close, practically representing a the same true track). With this, we get a duplicate rate of 0.21:
The Hough transform is a good method overall. A way to possibly make it more efficient is to store the points of intersection for every bin and later if a bin is selected, then the parameters "m" and "b" get estimated as the median of all the intersections of that bin. This will increase the accuracy up to 0.001, which could be a slight (yet noticeable) improvement.
Task 3

Took a look at the Helix equation. Made some notes. Have some questions to ask.

AndrewSpano commented 3 years ago

Tasks 4 & 5

Tasks 4 and 5 are linked (actually 4 can't be done without doing 5 first). I uploaded the code in src/utils/metrics.py (in the new pull request). I don't think it makes much sense to post code here, so I will post the results:

Efficiency Rate. Implemented exactly as described above. Computed it separately for every estimated track:
Fake Rate. An estimated track is defined as fake if the number of hits of the leading particle is less than a fraction of the total number of hits (book-kept hits) for that estimated track. Eg: If a track has 10 hits but the leading particle has only 2 hits (could be 5 different particles where each had 2 hits), then the track is considered fake. This was not the case for any estimated track:
Duplicate Rate. Two tracks are defined duplicates if they have the same leading particle (correct me if I'm wrong!). The result I got for the estimated tracks are:

ToDo: As mentioned yesterday, the number of different (truth) particles in an event is unknown. Therefore, it is wrong to sample the "top 25" tracks, since in real-life scenarios we will not have this information. In order to fix this, I thought of picking all the tracks where their corresponding bins have a minimum number of book-kept hits inside them. This will probably give us a higher number of tracks than needed, but at least we won't undershoot. Maybe discuss this in the next meeting.

asalzburger commented 3 years ago

Efficiency rate per track we call matching probability, then we use efficiency only for the ensemble of tracks.

AndrewSpano commented 3 years ago

Updates tasks:

Plot has per matching probability has been done (also the fixes regarding the selection of the leading particle were implemented):
Efficiency per eta-pT values almost done, need to fix some bugs, likely will upload the code tomorrow
Selection function: still working on it

noemina commented 3 years ago

That's good!! Can you try to put the matching probability on the x axis? Something like this for example ;)

asalzburger commented 3 years ago

Exactly, there we can se how we can make a cut.

asalzburger commented 3 years ago

Screenshot 2021-07-12 at 10 18 45

asalzburger commented 3 years ago

This is purely in the (x/y)-plane, you ignore that the hit or the track have a longitudinal (z) component.

That's justified for this example, because the magnetic field is constant in z-axis. -> the helix is a circle in the transverse

AndrewSpano commented 3 years ago

Task 7

In order to complete this task I parsed all the dataset (.csv) files and grouped together (in new dataframes) the objects that have similar values (for eta and p_T). Then I run the whole pipeline for each range of values in order to get the efficiency. I got the following results:

The algorothm seems to be performing ok for the different eta values. For the transverse momentum, it's not performing well for lower values (could be a bug in my code?). Maybe the tracks are so similar that the bin size of (0.0001, 0.0001) can't distinguish them. I will have to look into this further.
Task 6

Fixed the plot, placed the matching probability in the x-axis:
Tasks 8 & 9 (selection/classification & helix Hough transform)

Ongoing, I have some ideas for the selection phase which I will try to implement tomorrow.

AndrewSpano commented 3 years ago

Task 8

Selection: Most methods have been implemented, I have a few questions though for the holes.
Task 9

Ongoing.

AndrewSpano commented 3 years ago

Regarding the helical (circular when projected in 2d space) Hough transform:

I started by tackling the one-particle files. Let's take for example this one:

one_p_xy

Since in real-life scenarios we will not know the momentum of a particle for any given hit, I tried to solve this fitting problem without using any extra information. The main idea I came up with (after consulting with Noemi) is to utilize the fact that the circular track must always go through the origin (0, 0). Since finding a circle requires findings 3 variables (a radius and a center), this information above reduces the problem to finding 2 variables that are linearly connected. This means that we can apply the methods from the previous task. For a more detailed explanation of the math:

explanation

The lines in the Hough Space look like this:

one_p_tracks

For the selection of the candidate bins, I selected those that had at least 10-11 hits, since previous analysis showed that there are on average 14 hits per particle. The result I got is:

one_p_fit

I tried to estimate the emittance angle from the circle as the angle of the tangent-line-at-the-origin with x-axis:

one_p_phi

The ground truth value I got (which is computed using: phi = arctan2(py, px)) is slightly different from the estimated one:

fail_phi

I might have to look for bugs in the code.

Now regarding the many-particle files, I picked this one at random:

many_p_xy

The lines in the Hough Space look like this:

many_p_tracks

Running the algorithm will yields the following results:

many_p_fit

By assessing the estimated helical tracks we get the following results:

results

which look kind of promising.

There are some notes I have to make:

I want to make the matching-probability and efficiency-per-eta-pT-values plots. I wanted to use PyRoot for the plots, but I can't get it to work. It says that it is not compatible with the Python version I am using. I tried python 2.7, 3.4, 3.5, 3.6, 3.9 and none of the worked. I will try again tomorrow to fix it.
I found a possible bug in my efficiency function. When computing the efficiency, I didn't consider the case where 2 estimated tracks might have the same leading particle. In this case, both tracks will be considered as legit when this shouldn't be the case because they are technically duplicates. I was thinking of fixing it, just wanted to consult with you before I pushed the code with the changes in these functions.

AndrewSpano commented 3 years ago

After trying to install ROOT for over 3 hours (I even tried compiling the source code, run into many errors, checked the forum, couldn't solve some of them as there weren't relevant posts), I concluded that there is no point in installing it in this laptop, as I recently bought a new one (which I am currently setting up). I will try to install it in the new laptop once I start using it.
I plotted (for one estimated track) the lines in the Hough Space and the estimated intersection point from the binning:

The same results can be seen for any other estimated track, if chosen.
I refactored the metrics (now they are almost as modular as it can get).
I couldn't find the reason why the estimated emittance angle is incorrect. One thing that I noticed is that for single-particle files, since we know that all the hits belong to the same particle (with the same curvature and momentum), I believe (?) it should be expected that all of them have the same emittance angle. This is not the case:

Is this behavior expected?
I did a bit of research on other algorithms for line/curve fitting, and I found out about the RANSAC algorithm.

This is another algorithm that will work out nice for ideal data, but probably not for real-world data. The idea behind it is detect outliers by iteratively sampling random data points and fitting a model's parameter to them. Then we can use that model to find the number of outliers. The parameters that have the least outliers (i.e. the most inliers) will be stored and preferred in the future.

I implemented it for the x-y plane (circle fitting) and the r-z plane (line fitting). The results are quite good for such a simple algorithm. For the circle (x-y plane) fitting, it managed to find all of the tracks in just 19 seconds:

For the line (r-z plane) fitting, it managed to find 21/25 tracks in just 16 seconds:

The notebook can be found in src/notebooks/issue3/RANSAC.ipynb.
PS. I haven't uploaded the code yet because the previous pull request hasn't been merged and I think (?) that if I switch branches (in order to upload the new code) then I will lose all the changes. We can sort this out in the morning!.

noemina commented 3 years ago

Effect of binning on the evaluation of the crossing point. Screenshot_20210720_093902

asalzburger commented 3 years ago

[x] estimate the phi_0 & compare this with 'particles_final['particle_id'].momentum.phi`

AndrewSpano commented 3 years ago

Indeed, I had to check the initial*.csv file for the tangent at the origin:

finally_works

asalzburger commented 3 years ago

That's re-assuring

asalzburger / sms2021-tra-tra

First (straight line) hough transform #5