Duplicate Removal - Githubissues

asalzburger / sms2021-tra-tra

Repository for SummerStudent 2021 project to learn a (conformal) TRAnsform for TRAcks

2 stars 0 forks source link

Duplicate Removal #26

Open AndrewSpano opened 3 years ago

AndrewSpano commented 3 years ago

[x] Implement baseline methods for removing duplicate tracks from the Hough Transform output.

For both baseline approaches implemented, the efficiency drops as well with the duplicate-fake rates. This happens because tracks that are not duplicates are considered so, thus they are removed. To solve this, we must fine tune a bit further those baseline algorithms:
- [x] Implement a purity function to see what percentage of estimated tracks actually belong to the leading particle, in order to understand how much noise there is in the bins.
- [x] Fine-tune the algorithm so that the efficiency is not affected at all.
[x] Implement a more profound (yet somewhat deterministic) method of filtering out duplicate tracks. Maybe build on top of the baseline and also use geometries?
[ ] Implement a Machine Learning approach to duplicate removal. For this:
- [x] Setup the NVIDIA GPU on my computer.
- [ ] Implement a function that given the results of a run of the Hough Transform algorithm, creates a dataset with duplicate and non-duplicate tracks. Make sure that the data is not biased and not too easy for the NN to distinguish it. That is, for non-duplicate tracks (negative examples), make sure they are actually kind of close in the hough space, since the duplicate tracks will also be close in the Hough space.
- [ ] Do the above for many bin-sizes and nhits, in order to make sure that we don't overfit to a specific case.
- [ ] Also create a test dataset with unseen bin-size and nhits.
- [ ] Train the model in the training data and assess the results on both seen and unseen data.

AndrewSpano commented 3 years ago

After doing some testing, I found the following patterns:

Tracks that are marked as duplicates (even though they shouldn't be), both almost always have purity < 0.5. This suggests that there is quite some noise in the estimated tracks.
The common hit percentage for them is around 0.2 - 0.4. So using a threshold of 0.5 should avoid considering them as duplicates. The problem is that also for tracks that actually are duplicates (and they correctly considered to be duplicates by the algorithm), also have a low common hit percentage. This is due to the noise in them (hence the low purity also in them). Looks like I will have to use Geometries in order to either filter out some hits from the bins or completely disregard some tracks.

asalzburger commented 3 years ago

Do you have some plots for the purity?

AndrewSpano commented 3 years ago

Purity vs count (how many tracks had purity falling in the ranges 0 - 0.1, 0.1 - 0.2, etc..)

purity-vs-count-plot

AndrewSpano commented 3 years ago

Regarding the "deterministic" approach, while I was on the plane to Greece, I had a very stupid idea: For every x-y bin selected, run the r-z Hough Transform for -only- the hits inside that bin. This will help purify the hits. The idea was inspired by this plot:

rz-hits

The result was good:

purified-rz-hits

The Purity vs count plot now looks like this:

purified-purity

The performance (for this one event) can be assess by the metrics:

Just purification
Purification + duplicate-removal-1 algorithm
Purification + duplicate-removal-2 algorithm

So I tried doing it for all the events in the with-material and non-homogenous-magnetic-field dataset. The results I got were pretty surprising:

average-metrics

By analyzing later I saw for every event, at most 1 or 2 particles are not identified. This could due to the approximation error. Either ways, for more than half the events, the efficiency is 1.0, which should be good enough. I will postpone the Neural Network development as this approach is already yielding very good results.