Open bwvdnbro opened 3 years ago
Thanks a lot @bwvdnbro for all the very useful details! The list of commits was very useful for having a quick look, although I think towards the end it got confusing -- most of those commits are not in the ICRAR fork, but on pelahi's. The bug-fixing content content however has made it more or less into the ICRAR repo.
Anyway, I went quickly through the diff and from what I can see I'd guess the particular commit that changes SO-related behavior is https://github.com/ICRAR/VELOCIraptor-STF/commit/bf9d9a1a0ddc6deb51b635df7e13e747aa8e1d84. The commit message mentions a refactoring because some code was put into a separate function, but there's also a change in behavior in how rc
(a radius value, probably eventually used to determine SO containment) is calculated. It goes from:
rc=posparts[indices[j]].Length();
to
auto jj = indices[j];
auto rc=radii[jj];
This is obviously confusing -- based on the variable names only it would look like the second (the current) version is correct, but I don't really know better. This is also one of the commits that is not part of ICRAR's master
.
The "corresponding" change in the ICRAR repo is 9af51df5634dc417781d1884a133bb33e4d292d5. Here there's some more information about what's happening, saying that it "fixes the issues found in the SO masses", and that the problem was that "rc was defined as length rather than radii".
Note that the change above applies to the GetSOMasses
function. In the same file there are other similarly-looking functions that seem to also use a radius criterion on groups to perform several actions. However they have different variations on how this rc
value is calculated. What follows is more of a brain dump rather than an actual pointer to any solution, but hopefully will help us find what's going on:
GetSOMasses
uses rc = radii[indices[j]]
GetInclusiveMasses
uses rc = posparts[indices[j]].Length()
(the old GetSOMasses
behavior)CalculateSphericalOverdensity
(taking vectors of radii, masses and indices) is called by both GetInclusiveMasses
and GetSOMasses
and uses rc = radii[indices[j]]
CalculateSphericalOverdensity
(taking a Part *
input) is called by GetInclusiveMasses
and uses rc = Part[j].Radius()
radii
is used, it seems to be equally defined both in GetInclusiveMasses
and GetSOMasses
roughly like: radii[j]=0;
for (k=0;k<3;k++) {
dx=Part[taggedparts[j]].GetPosition(k)-pdata[i].gcm[k];
if (opt.p>0) {
if (dx>opt.p*0.5) dx-=opt.p;
else if (dx<-opt.p*0.5) dx+=opt.p;
}
if (opt.iextrahalooutput) {
posparts[j][k]=dx;
velparts[j][k]=Part[taggedparts[j]].GetVelocity(k)-pdata[i].gcmvel[k];
}
radii[j]+=dx*dx;
}
radii[j]=sqrt(radii[j]);
postparts[j].Length()
translates to sqrt(coord[0]^2 + coord[1]^2 + coord[2]^2)
(here), so again it seems in principle using either posparts[j].Length()
or radii[j]
should yield the same result? Except of course that: a) radii is already calculated, and b) posparts
is only meaningful when opt.iextrahalooutput
is set. There was also some mentioning of OpenMP problems in one of the commits, maybe one of these variables is not thread-safe.Thanks for digging through the commits!
Maybe a word of caution here from an external bystander: we (or at least I) still think the SO masses themselves are correct, indicating that GetSOMasses()
is likely doing the right thing. And that it is hence the other function that is at fault.
Hi @rtobar, thanks for having a quick look at this. I spent a lot of time yesterday looking at the code, but since I am very unfamiliar with it, I could not identify any obvious problems.
From what I have seen, I don't think the radius change can explain the large SO particle lists. This radius is only used to compute SO properties, and does not affect how particles are searched.
All the runs I mentioned use Inclusive_halo_masses=3
, so the SO lists originate from GetSOMasses
via substructureproperties.cxx:5124
. As far as I can tell, the SO lists are then determined by
substructureproperties.cxx:3380: taggedparts=tree->SearchBallPosTagged(posref,pow(maxrdist[i],2.0));
The tagged particles are sorted by distance from the centre of the SO (we use Reference_frame_for_properties=2
, so posref=pdata[i].gposminpot
) and the particle list should then be trimmed by the call to CalculateSphericalOverdensity()
on line 3535
. This is the version on line 6954
, so the first overload you mention. If the particle list is too large, then I would think that is either because of the SearchBallPosTagged()
or the CalculateSphericalOverdensity()
call. Since the former seems unlikely to produce a wrong result, I would think the problem is in the latter. Unless of course maxrdist[i]
has changed.
As Matthieu mentioned, it does look like the SO properties themselves are correct (or not obviously wrong), so whatever is affecting the trimming of the particle list is not causing any obvious other problems.
As for thread-safety: since I am still suffering from negative density errors (#53), all of my tests were run on a single thread. So thread-safety cannot be the issue here.
Also, when I was going through the diff yesterday, the following line caught my attention:
search.cxx:1926 Partsubset[i].SetID(i);
To my uninformed eye, it looks like the particle ID is being overwritten here, and I found this weird. I even ran a test with this line commented out, to check if this is somehow related to the problem. Unfortunately, this test produced exactly the same results as before, so this is probably in a part of the code that we don't use in our runs. Still, I'm not sure what this line is supposed to do, and the fact that particle IDs get overwritten in some parts of the code makes me uneasy.
@bwvdnbro thanks for all the insights and clear explanation. I dug a bit more into the code following your pointers, and I think I understand what you mean -- although I don't know what goes into fixing it other than "you need llindex
to become lower". I think your hunch is correct too: most likely it's CalculateSphericalOverdensity
that needs fixing, not SearchBallPosTagged
.
Having said that, maybe a (relatively) recent candidate commit to have a closer look is 969d3f0, which changed some of the computations done here to be log10/pow(10)-based rather than log/exp-based, plus a refactoring of the code into a separate function.
About the changing IDs: be aware that there are some attributes of Particle
that are not exactly fixed. For example, a Particle's Position
is not fixed throughout the life of the program's execution, but is changed in some places to be based on different points of reference (and then back to the original values I think). The same happens with the IDs, which are constantly used for different purposes, then reset back to their original values (maybe?). For example in GetInclusiveMasses
you can see:
and then
In any case, there is a separate PID attribute that I think is meant to be fixed throughout the life of the program.
@rtobar I did notice that the IDs are set and reset all the time. In the particular instance I pointed out, the IDs are set, but never reset as far as I can tell. That's what got me worried.
I have currently given up on trying to figure out why the llindex
is so large. The radii that are used to decide when to stop increasing llindex
in CalculateSphericalOverdensity
are all written to the output, and based on that I would expect the maximum particle radius to be about 4 times smaller than it appears to be. However, that maximum radius itself was already quite a lot larger than I thought it would be (R_vir
is quite large), so maybe there is no problem and the SO list is supposed to be this large. I think for our purpose it might be better to just trim the SO list based on a desired radius rather than just outputting everything. I am giving that a try now.
It is possible that the change from natural to base 10 logarithms changed the radii, but I have not been able to test this, since my snapshots don't work with these old versions.
@EvgeniiChaikin could you give the details of your very hungry run here as well? That may help identify what is going on.
These much-to-extended SO lists seem to cause trouble when analysing big runs as we need >6x the memory of the run itself to run VR, but only in the SO list construction bit.
If the particle list is too large, then I would think that is either because of the
SearchBallPosTagged()
or theCalculateSphericalOverdensity()
call
I was reading the previous comments and this immediately popped up to me. Given what we found in #73, this issue might as well be a different symptom of the same problem.
@EvgeniiChaikin could you give the details of your very hungry run here as well? That may help identify what is going on.
These much-to-extended SO lists seem to cause trouble when analysing big runs as we need >6x the memory of the run itself to run VR, but only in the SO list construction bit.
Actually, this seems to be true for every run and every box size.
For debugging purposes, a good example would be
/cosma7/data/dp004/dc-chai1/HAWK/01l_reference_L006N0094
(it runs in 1-2 mins)
The size of the z=0.0 snapshot of this run (colibre_0023.hdf5
) is 402 MiB
However, when I run Velociraptor on it, in the log file I get
[ 0.045] [ info] main.cxx:187 There are 1661154 particles in total that require 278.819 [MiB]
[ 0.045] [ info] main.cxx:189 There are 830570 baryon particles in total that require 139.408 [MiB]
[ 8.886] [debug] io.cxx:128 Memory report at io.cxx:128@void ReadData(Options &, std::vector<NBody::Particle, std::allocator<NBody::Particle>> &, long long, NBody::Particle *&, long long): Average: 2.790 [GiB] Data: 2.691 [GiB] Dirty: 0 [B] Library: 0 [B] Peak: 2.839 [GiB] Resident: 2.689 [GiB] Shared: 7.457 [MiB] Size: 2.790 [GiB] Text: 3.090 [MiB]
[ 8.886] [ info] main.cxx:276 1661154 particles loaded in 8.853 [s]
[ 9.038] [debug] search.cxx:81 Memory report at search.cxx:81@long long *SearchFullSet(Options &, long long, std::vector<NBody::Particle, std::allocator<NBody::Particle>> &, long long &): Average: 2.790 [GiB] Data: 2.691 [GiB] Dirty: 0 [B] Library: 0 [B] Peak: 2.839 [GiB] Resident: 2.689 [GiB] Shared: 7.629 [MiB] Size: 2.790 [GiB] Text: 3.090 [MiB]
[ 27.615] [debug] search.cxx:3469 Memory report at search.cxx:3469@long long *SearchBaryons(Options &, long long &, NBody::Particle *&, long long, std::vector<NBody::Particle, std::allocator<NBody::Particle>> &, long long *&, long long &, long long &, int, int, PropData *): Average: 2.848 [GiB] Data: 2.778 [GiB] Dirty: 0 [B] Library: 0 [B] Peak: 2.906 [GiB] Resident: 2.777 [GiB] Shared: 8.348 [MiB] Size: 2.877 [GiB] Text: 3.090 [MiB]
[ 60.021] [ info] main.cxx:27 Memory report at main.cxx:27@void finish_vr(Options &): Average: 2.859 [GiB] Data: 2.801 [GiB] Dirty: 0 [B] Library: 0 [B] Peak: 3.041 [GiB] Resident: 2.799 [GiB] Shared: 9.383 [MiB] Size: 2.901 [GiB] Text: 3.090 [MiB]
where the memory usage at the peak is 3.041 GiB, i.e. Velociraptor uses 7.5 times more memory than the size of the snapshot,
How to reproduce:
/cosma7/data/dp004/dc-chai1/vrconfig_3dfof_subhalos_SO_hydro_final3.cfg
;cmake -DVR_USE_GAS=ON -DVR_USE_STAR=ON -DVR_USE_BH=ON
Also, note that we already need much more memory when the data are being loaded by the VR.
@EvgeniiChaikin I had a look at the example you pointed out, and while I see the big memory usage, I don't think it's related to the problem discussed here.
Here's a look at the memory usage of VR using your input data/conf (didn't run it until the end, but the input data has been read already and the initial KDTrees are being created):
Highlighted is the 278 MB of memory used by the Particle objects themselves, which agrees with what you see in the logs below. However there are two pieces of memory bigger than that, of 1.02 GB and 358 MB respectively, that correspond to the std::map
s used to keep track of the extra data of each particle (the first is the maps themselves, the second is the strings for the keys in the map). Note that this has nothing to do with SO lists being calculated or not afterwards -- the problem we're trying to track in this issue would show up as a spike in memory usage that would come after what is shown here.
This is consistent with the extracts of the logs you uploaded: you reach peak memory usage almost as soon as you finish reading the input data.
Don't get me wrong, it would be very nice if memory usage was lower than it is at the moment, but I just wanted to clarify that the particular example you are pointing out, even though it shows high memory usage compared to the initial file size (which might be also compressed?) doesn't suffer from the problem described in this ticket.
I'll open a new issue to keep track of this other problem though; it would be good to improve the data structures used to keep track of the extra information attached to each Particle, but that's a fairly invasive change that would require a non-trivial amount of work.
@rtobar thank you very much for your very clear and detailed answer!
I didn't know where that big extra-memory chunk could come from and my initial guess was the SO list. After reading your answer, I now understand the memory usage in VR much better!
Describe the bug The SO lists produced by the current master appear to be about a factor 10 larger than those produced by older versions for the same snapshot. I have a particular example where the number of particles in the SO list is almost 8e8, for a simulation that has less than 1e8 particles. Unsurprisingly, the majority of the particles is reported to be part of multiple SOs (on average 9, with an extreme case of a particle being part of 53 SOs).
I manually checked the distances of the reported SO particles to the SO centre and found that the maximum distance is a factor 10-15 larger than the maximum SO radius (in this case
R_100_rhocrit
). I wonder whether this has an impact on the reported SO properties, since I am unable to reproduce the various mass values using the particles from the SO list within the various SO radii (they typically are off by a factor 3).By comparing legacy snapshots processed with different versions, I managed to constrain the problem to the following diff: https://github.com/ICRAR/VELOCIraptor-STF/compare/ICRAR:64348794522c16f96ef4890ee94a08615fbd06c4...ICRAR:a43325cec3108d40f61214bab0cab50069dcb258. The oldest version produces sensible SO lists for which we were able to manually confirm that the SO properties are correct, while the newest one produces the same results as current master which might or might not be correct, but definitely have excessively large SO lists. This is based on the commit hashes reported in the
.configuration
file; I was never actually able to run any of these versions myself due to various errors.I tried this both with and without MPI enabled (all single node, single thread runs) and the results look similar (not exactly the same). The MPI-enabled output file is 3 times larger, but I guess this is due to compression.