Open asalzburger opened 3 years ago
Task 1
The idea is to start with single-particle events and then move to many-particle events. Let's take a look at a single particle event:
Now we have to "take" those points to the Hough space (aka parameter space), by using the following transformation: y = mx + b <=> b = -xm + y. This basically converts each point to a line, like so:
We can then try to find the intersection points of all those lines, and the "most common" one will coefficients where the points align best in the original space. This yields:
Now it's time to move to many-particle events. Let's take a look at their tracks:
Their corresponding lines in the Hough space will look like this:
It's a bit messy. Maybe finding exact intersections will not yield accurate results. Let's give it a shot:
Indeed the results are not very good. The rounding error introduced in order to count close lines as similar, is way too big. This can be solved by introducing the method of binning. We will discretize the 2D space into small bins, which all accumulate scores (votes) for every intersection that lies inside them. Each bin will represent a pair of (m, b) parameters. After the "voting procedure", we will select the bins with the top 24 values. This will yield:
which looks like it fits the data way better than the previous method. By looking at it with the eye, it looks like only 3 tracks were not found.
Task 2
Firstly, we have to define when two tracks are approximately equal. Two tracks will be equal when their lines are close to each other. Of course, a small threshold will be allowed. The function that does this is this one:
The Hough transform is a good method overall. A way to possibly make it more efficient is to store the points of intersection for every bin and later if a bin is selected, then the parameters "m" and "b" get estimated as the median of all the intersections of that bin. This will increase the accuracy up to 0.001, which could be a slight (yet noticeable) improvement.
Task 3
Took a look at the Helix equation. Made some notes. Have some questions to ask.
Tasks 4 & 5
Tasks 4 and 5 are linked (actually 4 can't be done without doing 5 first). I uploaded the code in src/utils/metrics.py (in the new pull request). I don't think it makes much sense to post code here, so I will post the results:
Efficiency Rate. Implemented exactly as described above. Computed it separately for every estimated track:
Fake Rate. An estimated track is defined as fake if the number of hits of the leading particle is less than a fraction of the total number of hits (book-kept hits) for that estimated track. Eg: If a track has 10 hits but the leading particle has only 2 hits (could be 5 different particles where each had 2 hits), then the track is considered fake. This was not the case for any estimated track:
Duplicate Rate. Two tracks are defined duplicates if they have the same leading particle (correct me if I'm wrong!). The result I got for the estimated tracks are:
ToDo: As mentioned yesterday, the number of different (truth) particles in an event is unknown. Therefore, it is wrong to sample the "top 25" tracks, since in real-life scenarios we will not have this information. In order to fix this, I thought of picking all the tracks where their corresponding bins have a minimum number of book-kept hits inside them. This will probably give us a higher number of tracks than needed, but at least we won't undershoot. Maybe discuss this in the next meeting.
Efficiency rate per track
we call matching probability
, then we use efficiency
only for the ensemble of tracks.
Updates tasks:
Plot has per matching probability has been done (also the fixes regarding the selection of the leading particle were implemented):
Efficiency per eta-pT values almost done, need to fix some bugs, likely will upload the code tomorrow
Selection function: still working on it
That's good!! Can you try to put the matching probability on the x axis? Something like this for example ;)
Exactly, there we can se how we can make a cut.
This is purely in the (x/y)-plane, you ignore that the hit or the track have a longitudinal (z) component.
That's justified for this example, because the magnetic field is constant in z-axis. -> the helix is a circle in the transverse
Task 7
In order to complete this task I parsed all the dataset (.csv) files and grouped together (in new dataframes) the objects that have similar values (for eta and p_T). Then I run the whole pipeline for each range of values in order to get the efficiency. I got the following results:
The algorothm seems to be performing ok for the different eta values. For the transverse momentum, it's not performing well for lower values (could be a bug in my code?). Maybe the tracks are so similar that the bin size of (0.0001, 0.0001) can't distinguish them. I will have to look into this further.
Task 6
Fixed the plot, placed the matching probability in the x-axis:
Tasks 8 & 9 (selection/classification & helix Hough transform)
Ongoing, I have some ideas for the selection phase which I will try to implement tomorrow.
Task 8
Selection: Most methods have been implemented, I have a few questions though for the holes.
Task 9
Ongoing.
Regarding the helical (circular when projected in 2d space) Hough transform:
I started by tackling the one-particle files. Let's take for example this one:
Since in real-life scenarios we will not know the momentum of a particle for any given hit, I tried to solve this fitting problem without using any extra information. The main idea I came up with (after consulting with Noemi) is to utilize the fact that the circular track must always go through the origin (0, 0). Since finding a circle requires findings 3 variables (a radius and a center), this information above reduces the problem to finding 2 variables that are linearly connected. This means that we can apply the methods from the previous task. For a more detailed explanation of the math:
The lines in the Hough Space look like this:
For the selection of the candidate bins, I selected those that had at least 10-11 hits, since previous analysis showed that there are on average 14 hits per particle. The result I got is:
I tried to estimate the emittance angle from the circle as the angle of the tangent-line-at-the-origin with x-axis:
The ground truth value I got (which is computed using: phi = arctan2(py, px)) is slightly different from the estimated one:
I might have to look for bugs in the code.
Now regarding the many-particle files, I picked this one at random:
The lines in the Hough Space look like this:
Running the algorithm will yields the following results:
By assessing the estimated helical tracks we get the following results:
which look kind of promising.
There are some notes I have to make:
After trying to install ROOT for over 3 hours (I even tried compiling the source code, run into many errors, checked the forum, couldn't solve some of them as there weren't relevant posts), I concluded that there is no point in installing it in this laptop, as I recently bought a new one (which I am currently setting up). I will try to install it in the new laptop once I start using it.
I plotted (for one estimated track) the lines in the Hough Space and the estimated intersection point from the binning:
The same results can be seen for any other estimated track, if chosen.
I refactored the metrics (now they are almost as modular as it can get).
I couldn't find the reason why the estimated emittance angle is incorrect. One thing that I noticed is that for single-particle files, since we know that all the hits belong to the same particle (with the same curvature and momentum), I believe (?) it should be expected that all of them have the same emittance angle. This is not the case:
Is this behavior expected?
I did a bit of research on other algorithms for line/curve fitting, and I found out about the RANSAC algorithm.
This is another algorithm that will work out nice for ideal data, but probably not for real-world data. The idea behind it is detect outliers by iteratively sampling random data points and fitting a model's parameter to them. Then we can use that model to find the number of outliers. The parameters that have the least outliers (i.e. the most inliers) will be stored and preferred in the future.
I implemented it for the x-y plane (circle fitting) and the r-z plane (line fitting). The results are quite good for such a simple algorithm. For the circle (x-y plane) fitting, it managed to find all of the tracks in just 19 seconds:
For the line (r-z plane) fitting, it managed to find 21/25 tracks in just 16 seconds:
The notebook can be found in src/notebooks/issue3/RANSAC.ipynb.
PS. I haven't uploaded the code yet because the previous pull request hasn't been merged and I think (?) that if I switch branches (in order to upload the new code) then I will lose all the changes. We can sort this out in the morning!.
Effect of binning on the evaluation of the crossing point.
Indeed, I had to check the initial*.csv file for the tangent at the origin:
That's re-assuring
Implement a first straight line based hough transform, e.g. from sci-kit learn:
https://scikit-image.org/docs/0.3/auto_examples/plot_hough_transform.html
Further refinements:
(1) Truth based efficiency function
Given a set of hits found
[f]
, look up how many of the hits are actually from the same particle using the truth association. I.e. every hit in the production file has an identifier, which particle produced it.Updates:
(2) So far, you are taking the 25 best tracks per event, because you know (falsely) that there are 25 particles produced.
To overcome this, we use selection criteria without looking at truth information: