Closed markcmiller86 closed 1 year ago
The edge labels are samples, not calls. See Edge Information. The slowdown is only 2x on rzansel but 1000x on sierra. Why so much bigger? And why are we slow on rzansel and sierra but not other systems like rzgenie and jade?
We discovered the issue when regression tests on sierra which normally take 1 minute were killed by ats after running for 60 minutes. Which tests "timeout" differs every night, eg on Monday 10 tests timeout and on Tuesday 8 tests timeout and only 1 test timedout both days. For about 80% of the "timeout" tests, the last message printed to stdout says Writing Restart or Writing Graphics.
Well, I may be misinterpreting the numbers on the edges. If those are sample and not actual calls, then I apologize. I interpreted them as actual calls.
That said, I am able to produce cases on my macOS system where strncat is taking ~30% of the total time. And, HDF5-1.12 is about 2-3x slower than HDF-1.8.14 for the same operations. So, there is definitely some issue that I am sure I can improve.
I have investigated this issue with a modified dir.c
test a snipit of which is captured below...
int main(int argc, char *argv[])
{
int meshtypes[3], nmesh;
char *meshnames[3], original_dir[128];
DBfile *dbfile, *dbfile2, *dbfile3, *dbfile4, *dbfile5;
char *filename = "dir.pdb";
char *filename2 = "dir2.pdb";
int i, driver = DB_PDB, driver2 = DB_PDB;
int show_all_errors = FALSE;
char *objname = 0;
int ndirs = 0;
int ntocs = 0;
for (i=1; i<argc; i++) {
if (!strncmp(argv[i], "DB_PDB",6)) {
driver = StringToDriver(argv[i]);
driver2 = driver;
filename = "dir.pdb";
filename2 = "dir2.pdb";
} else if (!strncmp(argv[i], "DB_HDF5", 7)) {
driver = StringToDriver(argv[i]);
driver2 = driver;
filename = "dir.h5";
filename2 = "dir2.h5";
} else if (!strcmp(argv[i], "mix-driver-copy")) {
driver2 = driver == DB_HDF5 ? DB_PDB : DB_HDF5;
filename2 = driver == DB_HDF5 ? "dir2.pdb" : "dir2.h5";
} else if (!strcmp(argv[i], "show-all-errors")) {
show_all_errors = 1;
} else if (!strncmp(argv[i], "ndirs=", 6)) {
ndirs = (int) strtol(argv[i]+6,0,10);
} else if (!strncmp(argv[i], "ntocs=", 6)) {
ntocs = (int) strtol(argv[i]+6,0,10);
.
.
.
for (i = 0; i < ndirs; i++)
{
char dirname[64];
sprintf(dirname, "domain_%08d", i);
DBMkDir(dbfile, dirname);
}
.
.
.
/* make some hellaciously long directory names which are almost the same */
DBClose(dbfile);
dbfile = DBOpen(filename, driver, DB_APPEND);
if (ntocs)
{
char tmp[] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
DBSetDir(dbfile, "/");
DBMkDir(dbfile, "testtoc");
DBSetDir(dbfile, "/testtoc");
for (i = 0; i < 26; i++)
{
tmp[235] = (char) (97+i);
DBMkDir(dbfile, tmp);
}
for (i = 0; i < ntocs/2; i++)
{
DBtoc *dbtoc;
DBSetDir(dbfile, "/");
dbtoc = DBGetToc(dbfile);
DBSetDir(dbfile, "/testtoc");
dbtoc = DBGetToc(dbfile);
}
}
I have then tested Silo 4.11 and 4.10.3 with HDF5-1.8 and HDF5-1.14 on my macOS laptop and on a Lassen login node in both my home dir on Lassen and /p/gpfs1
and /p/vast
and with various of the HDF5 VFD's Silo supports. I do not find any cases where the toc loop, above, is 1000x slower. I don't even see cases where performance is >2x slower.
In my tests, I was doing 10,000 DBGetToc()
calls and calling DBSetDir()
to switch current working directory between each call to ensure Silo's TOC caching was not coming into play.
I tried creating directories which contain a slew of long-named objects that differ only in the last few characters. This did have some effect on performance (as would be expected) but again, nothing approaching 1000x slower or even 10x slower. More like 2-3x slower (on Lassen).
I did discover that better args to H5Literate()
which does the traversal of named entries in an HDF5 group can actually improve performance by an order of magnitude. However, it does mean that the objects are iterated in a different order than previous versions of Silo. On the other hand, this is all used to populate a DBtoc*
which gets returned to caller. So, I don't think caller cares about the order of traversal used by HDF5 library to do that.
I did compare the stack dumps provided and there is a difference in the load_toc()
function implementation in the two cases captured. The load_toc()
function in the Silo driver is a callback used by H5Literate()
to visit all the entries in an HDF5 group (dir).
The newer Silo includes a couple of additional calls to the HDF5 library to interrogate entries found and decide if they are symlinks or not. These are the calls to H5Gget_objinfo()
and H5Lget_info()
.
Here is some performance data I gathered while in ~
on Lassen (clang compilers and using blueos3 installs of HDF5-1.8.16)...
case | timing (from home dir on Lassen) |
---|---|
Silo-4.10.3, ndirs=100000, PDB | 1.676u 0.598s 0:02.30 98.2% 0+0k 19840+15488io 2pf+0w |
Silo-4.10.3, ndirs=100000, HDF5 | 1.985u 1.122s 0:03.27 94.8% 0+0k 4352+225024io 2pf+0w |
Silo-4.11.0, ndirs=100000, PDB | 2.500u 0.550s 0:03.08 99.0% 0+0k 7552+15488io 0pf+0w |
Silo-4.11.0, ndirs=100000, HDF5 | 2.148u 1.044s 0:03.36 94.6% 0+0k 0+225024io 0pf+0w |
Silo-4.10.3, ntocs=10000, PDB | 1.094u 0.118s 0:01.22 98.3% 0+0k 0+896io 0pf+0w |
Silo-4.10.3, ntocs=10000, HDF5 | 1.329u 0.000s 0:01.33 99.2% 0+0k 0+1280io 0pf+0w |
Silo-4.11.0, ntocs=10000, PDB | 2.078u 0.179s 0:02.26 99.1% 0+0k 0+896io 0pf+0w |
Silo-4.11.0, ntocs=10000, HDF5 | 3.415u 1.557s 0:04.97 99.7% 0+0k 0+1152io 0pf+0w |
case | timing (from /p/gpfs1 on Lassen) |
---|---|
Silo-4.10.3, ndirs=100000, PDB | 1.682u 1.261s 0:02.95 99.6% 0+0k 640+0io 1pf+0w |
Silo-4.10.3, ndirs=100000, HDF5 | 2.182u 2.521s 0:04.76 98.7% 0+0k 0+0io 0pf+0w |
Silo-4.11.0, ndirs=100000, PDB | 2.550u 1.235s 0:03.79 99.7% 0+0k 0+0io 0pf+0w |
Silo-4.11.0, ndirs=100000, HDF5 | 2.238u 2.667s 0:04.98 98.1% 0+0k 0+0io 0pf+0w |
Silo-4.10.3, ntocs=10000, PDB | 1.108u 0.229s 0:01.34 98.5% 0+0k 0+0io 0pf+0w |
Silo-4.10.3, ntocs=10000, HDF5 | 1.436u 0.000s 0:01.44 99.3% 0+0k 0+0io 0pf+0w |
Silo-4.11.0, ntocs=10000, PDB | 2.160u 0.280s 0:02.44 100.0% 0+0k 0+0io 0pf+0w |
Silo-4.11.0, ntocs=10000, HDF5 | 3.595u 3.645s 0:07.25 99.7% 0+0k 0+0io 0pf+0w |
I have not tested with HDF-1.10 or HDF5-1.12 both of which have known performance issues in certain circumstances.
Next, I tested Silo 4.11.1-pre1 with both HDF5-1.8 and HDF5-1.14
case | timing (from home dir on Lassen) |
---|---|
HDF5-1.8, ntocs=10000 DB_PDB | 1.825u 0.235s 0:02.08 98.5% 0+0k 11264+896io 3pf+0w |
HDF5-1.8, ntocs=10000 DB_HDF5 | 1.087u 0.057s 0:01.42 79.5% 0+0k 896+1280io 1pf+0w |
HDF5-1.8, ntocs=10000 DB_HDF5_SILO | 0.323u 0.007s 0:00.33 96.9% 0+0k 0+640io 0pf+0w |
HDF5-1.14, ntocs=10000 DB_PDB | 1.702u 0.151s 0:01.88 98.4% 0+0k 11648+896io 4pf+0w |
HDF5-1.14, ntocs=10000 DB_HDF5 | 0.680u 0.225s 0:01.16 77.5% 0+0k 3968+1152io 2pf+0w |
HDF5-1.14, ntocs=10000 DB_HDF5_SILO | 0.340u 0.026s 0:00.37 97.2% 0+0k 0+768io 0pf+0w |
Ok, so I discovered an issue with the really good results in some of the tests above. I had adjusted args to H5Literate()
to try to use creation order indexing but without having specified group creation properties to track and build the creation order index. So, the effect of this was that DBGetToc()
went really, really fast because it was failing immediately and not doing any work. So, the results in the above tables for the new Silo, 4.11.1-pre1, are experimental error.
I also used gprof
on the modified dir.c
to see what is reported for most expensive operations with Silo-4.11.1-pre1 on both HDF5-1.8.22 and HDF5-1.14.0.
./dir ntocs=10000 DB_HDF5
) HDF5-1.8.22Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
15.53 0.16 0.16 1971298 0.00 0.00 H5G__component
10.68 0.27 0.11 5550891 0.00 0.00 H5C_protect
4.85 0.32 0.05 2513483 0.00 0.00 H5O_assert
4.85 0.37 0.05 2160336 0.00 0.00 H5F_addr_decode_len
3.88 0.41 0.04 5550891 0.00 0.00 H5C_unprotect
3.88 0.45 0.04 3435469 0.00 0.00 H5SL_search
2.91 0.48 0.03 5550891 0.00 0.00 H5AC_unprotect
2.91 0.51 0.03 2512721 0.00 0.00 H5O_unprotect
2.91 0.54 0.03 980550 0.00 0.00 H5G_traverse_real
2.91 0.57 0.03 535277 0.00 0.00 H5SL_insert_common
2.91 0.60 0.03 385037 0.00 0.00 H5G_get_objinfo
1.94 0.62 0.02 2987595 0.00 0.00 H5G_name_free
1.94 0.64 0.02 1850210 0.00 0.00 H5E_clear_stack
1.94 0.66 0.02 965532 0.00 0.00 H5B_find
1.94 0.68 0.02 517782 0.00 0.00 H5SL_remove
1.94 0.70 0.02 200000 0.00 0.00 H5Lget_info
1.94 0.72 0.02 200000 0.00 0.00 load_toc
0.97 0.73 0.01 6613964 0.00 0.00 H5FL_blk_find_list
0.97 0.74 0.01 5880343 0.00 0.00 H5FL_reg_free
0.97 0.75 0.01 5550891 0.00 0.00 H5AC_protect
0.97 0.76 0.01 3306982 0.00 0.00 H5FL_blk_malloc
0.97 0.77 0.01 2243467 0.00 0.00 H5I__find_id
0.97 0.78 0.01 1920728 0.00 0.00 H5RS_own
0.97 0.79 0.01 1713008 0.00 0.00 H5I_object
0.97 0.80 0.01 1353837 0.00 0.00 H5FL_reg_calloc
0.97 0.81 0.01 1190706 0.00 0.00 H5WB_unwrap
0.97 0.82 0.01 1160203 0.00 0.00 H5G__ent_to_link
0.97 0.83 0.01 1010855 0.00 0.00 H5O_msg_read_oh
0.97 0.84 0.01 980815 0.00 0.00 H5O_msg_read
0.97 0.85 0.01 765553 0.00 0.00 H5G_oloc
0.97 0.86 0.01 735858 0.00 0.00 H5C_flush_single_entry
0.97 0.87 0.01 735080 0.00 0.00 H5F__accum_read
0.97 0.88 0.01 705108 0.00 0.00 H5FD_sec2_read
0.97 0.89 0.01 445050 0.00 0.00 H5G_node_decode_key
0.97 0.90 0.01 385037 0.00 0.00 H5G_get_objinfo_cb
0.97 0.91 0.01 375778 0.00 0.00 H5O_msg_free_real
0.97 0.92 0.01 206436 0.00 0.00 H5P_get
./dir ntocs=10000 DB_HDF5
) HDF5-1.14.0Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
15.22 0.21 0.21 5753410 0.00 0.00 H5C_protect
10.14 0.35 0.14 5753410 0.00 0.00 H5C_unprotect
9.42 0.48 0.13 2723503 0.00 0.00 H5O__assert
9.42 0.61 0.13 1961154 0.00 0.00 H5G__component
2.90 0.65 0.04 5753410 0.00 0.00 H5AC_unprotect
2.90 0.69 0.04 965532 0.00 0.00 H5B_find
2.90 0.73 0.04 748370 0.00 0.00 H5C__flush_single_entry
2.90 0.77 0.04 747604 0.00 0.00 H5C__load_entry
2.17 0.80 0.03 1920740 0.00 0.00 H5RS__xstrdup
2.17 0.83 0.03 430066 0.00 0.00 H5G__node_decode_key
2.17 0.86 0.03 197533 0.00 0.00 H5O__chunk_deserialize
1.45 0.88 0.02 16752014 0.00 0.00 H5AC_tag
1.45 0.90 0.02 16752014 0.00 0.00 H5CX_set_tag
1.45 0.92 0.02 975832 0.00 0.00 H5HL_protect
1.45 0.94 0.02 975478 0.00 0.00 H5G__traverse_real
1.45 0.96 0.02 747964 0.00 0.00 H5C__tag_entry
1.45 0.98 0.02 385037 0.00 0.00 H5VL_group_optional
1.45 1.00 0.02 320326 0.00 0.00 H5C_set_slist_enabled
0.72 1.01 0.01 11507540 0.00 0.00 H5C_verify_tag
0.72 1.02 0.01 9394157 0.00 0.00 H5FL_reg_free
0.72 1.03 0.01 8667277 0.00 0.00 H5F_get_intent
0.72 1.04 0.01 6260439 0.00 0.00 H5FL_reg_calloc
0.72 1.05 0.01 5753410 0.00 0.00 H5AC_protect
0.72 1.06 0.01 5493576 0.00 0.00 H5HL_offset_into
0.72 1.07 0.01 5156773 0.00 0.00 H5MM_memcpy
0.72 1.08 0.01 4831234 0.00 0.00 H5_trace
0.72 1.09 0.01 1936090 0.00 0.00 H5G__node_cmp3
On lassen, I am unable to reproduce issues observed on sierra. I will close this issue for now but arrange a time to investigate on sierra.
I got onto sierra and performed similar tests as those above and was not able to identify performance issues with DBGetToc that were originally reported.
I did expand my test to create many directories of very long (>200 character) almost identical names (except for last few characters), and although there was some performance degredation for this case (as I would expect), there was nothing even approaching an order of magnitude let alone three orders of magnitude.
I did notice that for a similar test involving 10,000 directory names and DBGetToc calls, PDB driver does do about 5x better than HDF5 1.14.0 on sierra. Again, a problem to be investigated but nothing worth holding up the release.
When we write a restart file in mercury on sierra, it looks like we are spending 61% of our time in DBGetToc and 48% of the time in strncat() way down the call stack. This only shows up in the "new" silo just released.
This confluence page has 2 .pdf files comparing the profile of the old vs new silo.
https://rzlc.llnl.gov/confluence/display/MER/Release+Issues