dhoule / CS533-Big-Data-Data-Mining

Parallel DBScan, loosely coupled, algorithm using the disjoint-set data structure. From Patwary, M. M. A., et al. The MAKEFILE has been fixed. The compile time errors have been fixed. The run time error; incorrectly numbering clusters; has been fixed.
2 stars 1 forks source link

Update run_dbscan_algo_uf_mpi_interleaved() to use pruning #65

Open dhoule opened 4 years ago

dhoule commented 4 years ago

This will take some time. Have to look into exactly where to do it.

Will update accordingly.

dhoule commented 4 years ago

Current problem is "over pruning":

mpiexec -n 16 ./mpi_dbscan -i clus50k.bin -b -m 5 -e 25

Dataset used: clus50k.bin
Number of process cores 16
Epsilon: 25 MinPts: 5
Dimensions of each point: 10
Init time 0.00420398
Local computation took 2.3343
Merging took 0.0646524
Parallel DBSCAN (init, local computation, and merging) took 2.45995 seconds

Points in clusters 46787 Noise 3213 Total points 50000
Total number of clusters 566

The number of clusters is supposed to be 51. Like before, this is just fine tuning the code...

dhoule commented 4 years ago

Well, the latest lest gave accurate results, but was HORRIBLE for time:

mpiexec -n 16 ./mpi_dbscan -i clus50k.bin -b -m 5 -e 25

Dataset used: clus50k.bin
Number of process cores 16
Epsilon: 25 MinPts: 5
Dimensions of each point: 10
Init time 0.00422827
Local computation took 105.643
Merging took 0.058158
Parallel DBSCAN (init, local computation, and merging) took 105.856 seconds

Points in clusters 46914 Noise 3086 Total points 50000
Total number of clusters 51
dhoule commented 4 years ago

2 things I can think of. I'm using the C++ library functions set_intersection, and copy. The use of the standard functions seems to add a lot of time to the code. As Josh did, and I did with a binary search usage, I'm going to have to make my own "intersection" function to speed things up.

For the copy function, I'm just going to use a "for loop."

This will be done after I've committed the code. I've met the accuracy requirement. Now, I have to meet the time requirement.

dhoule commented 4 years ago

I reworked the code. I got the timing down to a more "reasonable" area, but the accuracy is off again:

mpiexec -n 16 ./mpi_dbscan -i clus50k.bin -b -m 5 -e 25

Dataset used: clus50k.bin
Number of process cores 16
Epsilon: 25 MinPts: 5
Dimensions of each point: 10
Init time 0.000541579
Local computation took 1.6466
Merging took 0.155283
Parallel DBSCAN (init, local computation, and merging) took 1.80828 seconds

Points in clusters 38250 Noise 11750 Total points 50000
Total number of clusters 4630
dhoule commented 4 years ago

Some really great news! I used Josh's kdtree_set_difference function, and got a massive increase in speed! After moving some code around, I got the accuracy back again. Will be committing soon.

dhoule commented 4 years ago

The latest test results:

mpiexec -n 16 ./mpi_dbscan -i clus50k.bin -b -m 5 -e 25

Dataset used: clus50k.bin
Number of process cores 16
Epsilon: 25 MinPts: 5
Dimensions of each point: 10
Init time 0.00283968
Local computation took 30.2367
Merging took 0.055811
Parallel DBSCAN (init, local computation, and merging) took 30.3308 seconds

Points in clusters 45795 Noise 4205 Total points 50000
Total number of clusters 51
dhoule commented 4 years ago

The next move needs to be getting rid of the std::copy function. It is used twice in the dbscan.cpp file. ~105 sec --> ~30 sec with just 1 custom function. Can't wait to see what happens with this one!

dhoule commented 4 years ago

There was no improvement by getting rid of the std::copy() function.

My next plan is to add "core points" if they don't exist, and remove culled points, from the main loop vector. I need to figure out how to keep the dbs.neededIndices vector sorted while doing this. If the points have been seen not to be "core points," there is no sense in revisiting them.

dhoule commented 4 years ago

I got the time time, but the accuracy is off again:

mpiexec -n 16 ./mpi_dbscan -i clus50k.bin -b -m 5 -e 25

Dataset used: clus50k.bin
Number of process cores 16
Epsilon: 25 MinPts: 5
Dimensions of each point: 10
Init time 0.00460396
Local computation took 4.86409
Merging took 0.0447575
Parallel DBSCAN (init, local computation, and merging) took 4.91621 seconds

Points in clusters 41450 Noise 8550 Total points 50000
Total number of clusters 198
dhoule commented 4 years ago

Making a new check point. I find it odd that the time increased when going from 2 nodes to 16 nodes. It might be the need for more message passing.

mpiexec -n 2 ./mpi_dbscan -i clus50k.bin -b -m 5 -e 25

Dataset used: clus50k.bin
Number of process cores 2
Epsilon: 25 MinPts: 5
Dimensions of each point: 10
Init time 0.000403266
Local computation took 24.4247
Merging took 0.00231618
Parallel DBSCAN (init, local computation, and merging) took 24.4278 seconds

Points in clusters 45735 Noise 4265 Total points 50000
Total number of clusters 51

mpiexec -n 16 ./mpi_dbscan -i clus50k.bin -b -m 5 -e 25

Dataset used: clus50k.bin
Number of process cores 16
Epsilon: 25 MinPts: 5
Dimensions of each point: 10
Init time 0.00824141
Local computation took 30.9175
Merging took 0.0582847
Parallel DBSCAN (init, local computation, and merging) took 30.9877 seconds

Points in clusters 45795 Noise 4205 Total points 50000
Total number of clusters 51
dhoule commented 4 years ago

There was a problem rebuilding the ne vector. My last report is totally wrong. This is the latest report, as of the current corrections:

mpiexec -n 16 ./mpi_dbscan -i clus50k.bin -b -m 5 -e 25

Dataset used: clus50k.bin
Number of process cores 16
Epsilon: 25 MinPts: 5
Dimensions of each point: 10
Init time 0.00673148
Local computation took 61.8609
Merging took 0.0613414
Parallel DBSCAN (init, local computation, and merging) took 61.9743 seconds

Points in clusters 35729 Noise 14271 Total points 50000
Total number of clusters 58

I need to work on the timing, and the accuracy, again. But, it's down from the 120's sec.

dhoule commented 4 years ago

The same npid is being checked numerous times for different pid values. This causes redundant searches, massively slowing down the code.

Maybe a way to skip these points, is to keep track of them? If the npid values exists in some kind of tracking vector, there won't be a need to check it again, because it is a known center point.

But, it's from a different pid, though? That could mean that overlap points might be missed.

dhoule commented 4 years ago

The latest benchmark has taken the time down to ~7 sec, but the accuracy is hanging around 52-53 when it's supposed to 51.

dhoule commented 4 years ago

The latest push has led to over pruning.

dhoule commented 4 years ago

Still getting 51-52 clusters with the clus50k.bin dataset...