TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
4 stars 6 forks source link

C++ site update to production #375

Closed jteresco closed 1 year ago

jteresco commented 3 years ago

I'd like to work toward using the C++ version of the site update program. @yakra are you confident in the threaded version's readiness for production site updates?

I think that moving to the C++ program its inherent efficiency advantages over Python and the ability to do some meaningful multithreading, in conjunction with the addition of SSDs and/or using memory filesystem could really speed the whole process significantly.

jteresco commented 3 years ago

A first question on this: what C++ compiler and version do you use when you build it for Linux?

g++ 9.3.0 compiles and links, but I get a run time error on noreaster:

ld-elf.so.1: /usr/local/lib/gcc5/libstdc++.so.6: version GLIBCXX_3.4.22 required by /tm-tmpfs/travelmapping/DataProcessing/siteupdate/cplusplus/a.out not found

clang 10.0.1 doesn't compile it successfully, being unhappy with the use of subscript operators.

yakra commented 3 years ago

@yakra are you confident in the threaded version's readiness for production site updates?

Other than #281 (just cosmetic, an easy fix for a clean readable siteupdate.log), yes! I believe I have mutexes around everything else I need to, and even some things I don't. The last crashes & glitches were fixed in #333. Since then I've run siteupdate many thousands of times during speed tests with no other known issues except #327, which "is super obscure, and not too big a deal in the real world". It shouldn't hold the C++ version back from going live, and may become moot anyway once multi-threaded HighwayGraph setup enters the picture.

I think that moving to the C++ program its inherent efficiency advantages over Python and the ability to do some meaningful multithreading, in conjunction with the addition of SSDs and/or using memory filesystem could really speed the whole process significantly.

No need for super-fast storage subsystems. NMP_merged files take about 0.5s on lab2. Unsure what the exact bottleneck is here, but everything else is CPU/memory bound. Graph files are written to disk at less MBps than nmp_merged. Edit: Initially reading WPTs. I keep forgetting about that. It's fast if the data has been read recently & is cached, but can crawl otherwise.

A first question on this: what C++ compiler and version do you use when you build it for Linux?

g++ 9.3.0 compiles and links, but I get a run time error on noreaster:

ld-elf.so.1: /usr/local/lib/gcc5/libstdc++.so.6: version GLIBCXX_3.4.22 required by /tm-tmpfs/travelmapping/DataProcessing/siteupdate/cplusplus/a.out not found

Yup. Same message I got. In the immediate term, you can disable threading by commenting out https://github.com/TravelMapping/DataProcessing/blob/deead7ac837e9463f5f64d3008c0b9ff0c91fdf7/siteupdate/cplusplus/enable_threading.cpp#L1

What versions do I use?

box(es) ver
lab1, lab2 g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
lab3 g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
BiggaTomato g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
lab1.5, epoch g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

clang 10.0.1 doesn't compile it successfully, being unhappy with the use of subscript operators.

I'm new to clang *checks wikipedia article*. Interesting. What does it choke on? I'm using C++11, so on my linux boxes I use g++ siteupdate.cpp -std=c++11 -pthread -o siteupdate On noreaster, we don't even need those switches; once I comment out enable_threading.cpp, just g++ siteupdate.cpp -o siteupdate does the trick.

jteresco commented 3 years ago

All right, good. We can think about moving in this direction.

As far as clang, it gives pages of errors, many of which look like this:

In file included from siteupdate.cpp:72:
./classes/DatacheckEntry.cpp:69:14: error: type 'std::array' (aka 'array, allocator >, 6>') does not provide a subscript operator
                if (fpentry[0] != route->root)  return 0;
                    ~~~~~~~^~

I don't think it really matters to me if we compile with clang, g++, or something else, so I wouldn't consider that a high priority.

Here's a new Issue about getting this compiled and running with pthreads support: #376

yakra commented 3 years ago

My working tree on noreaster has (for the moment) some changes to 4b9c7198025f9dcb3a6d3a14c9f816b78018530a to enable compilation with clang++. siteupdateST compiles fine with g++, and runs to completion. With clang++, siteupdate & siteupdateST crash when Sorting colocation lists. Somehow, there's an invalid pointer for Waypoint::colocated. Strangely, always at la.la1088 LA36 (30.433911,-89.91344), even when multi-threaded.

Sorting WaypointQuadtree at (30.41015625,-90) to (30.5859375,-89.6484375) contains 37 waypoints Points sorted. Sorting colocation lists... ... 3 la.la0036 LA1088 (30.433911,-89.91344) 0x912e2a0e0 ... 19 la.la1088 LA36 (30.433911,-89.91344) 0x12e2a0e0

Even though this memory is allocated on the heap, I'm seeing consistent values for all Waypoint::colocated locations in the quadtree node, (No problem, that helps!) What's really weird here is how the 2 addresses above are the same except for the missing high-order nibble.

Even if I comment out Sorting colocation lists, there's still a crash when Setting up for graphs of highway data. No surprise here, as we'd still encounter the invalid Waypoint::colocated pointer and attempt to dereference it.

This will take a pretty deep dive, one that I don't have the attention span for right yet. Two other projects have each yielded an unmanageably large number of experimental branches. I want to play arborist, prune extraneous branches, and merge changes back into master before proceeding.

yakra commented 3 years ago

The bug above is reproducible with HighwayData @ d5d320322c2e46a722299b6bd1c8a6e25526fcbd UserData @ 3491f1b4e5dbe11a4d470ed67257bb929442af22 & 618d6c4616543c3fdee9273ddc012264e46e5303

Starting to get the impression it's related to a specific point (or range rather) in memory.

Checking out an older HighwayData commit to cause all the waypoint objects (and thus that particular colocation list pointer) to be stored in different places seems to do the trick. The pointer stays valid and siteupdate runs to completion.

So what's the deal? Is there bad RAM? Why do we lose exactly 0x900000000, with two ones turning to zeros?

yakra commented 3 years ago

What memory locations?

I'll take a closer look at where exactly that colocation list pointer is itself stored...

0x912d86208 consistently with siteupdateST. 0x914848208 - 0x914898208 with 7 trials of siteupdate -t 1. 0x919287768 - 0x91a957428 with 8 trials of siteupdate -t 4.

Doing a git bisect from HighwayData @ 27a6254 which ran OK, the crash first appears...

siteupdate.log with some debug text:

usala...................................................................................................................................................................................
........................................................................................................................................................................................
........................................................................................................................................................................................
........................................................................................................................................................................................
..................................................................................................................................................................DEBUG:  this=la.la1088
 LA36 (30.433911,-89.91344)
DEBUG:  this->colocated=0x9000aee40
DEBUG: other->colocated=0x9000aee40
........................................................................................................................................................................................
.............................................................................................................................................................!
usanyp......................!
usapr...................................................................................................................................................................................
........................................................................................................................................................................................
........................................................................................................................................................................................
........................................................................................................................................................................................
........................................................................................................................................................................................
........................................................................................................................................................................................
..................!
usaush...................................................................................................................!
zafr..........................................................!
zmbt......!
zmbm..............!
zwer.........!
zwep............!
afrtah.............................................!
argrn......!
asiaho...........!
asimo.................................!
ausb.....!
cannss...................!
codn..............................!
fraarad03.!
frabfcd25..!
frabfcd39...!
frabfcd58.!
frabfcd70...!
fragesd51....!
frahdfd02.....!
mexsf....................................................................!
naca..............!
ngae.!
rusr....................!
[23.0] Sorting waypoints in Quadtree.
#0      *0x87d2e8cc8    la.i059 11 (30.46041,-89.697297)
         0x883b6a540    la.us011 I-59(11) (30.46041,-89.697297)
         0x883b6a540    la.i059 11 (30.46041,-89.697297)
#1      *0x87d2e8d68    la.i059 LA/MS [alt: ['+999']] (30.46262,-89.695)
         0x87cf03580    ms.us011 LA/MS (30.46262,-89.695)
         0x87cf03580    la.us011 LA/MS (30.46262,-89.695)
         0x87cf03580    ms.i059 LA/MS [alt: ['+0']] (30.46262,-89.695)
         0x87cf03580    la.i059 LA/MS [alt: ['+999']] (30.46262,-89.695)
#2      *0x8fe902948    la.la0036 AirSt (30.452215,-89.988221)
#3      *0x8fe9029e8    la.la0036 LA1088 (30.433911,-89.91344)
         0xaee40        la.la1088 LA36 (30.433911,-89.91344)
         0x9000aee40    la.la0036 LA1088 (30.433911,-89.91344)
#4      *0x8fe902b28    la.la0036 LA41 (30.416481,-89.784994)
         0x87d71cdc0    la.la0041 LA36 (30.416481,-89.784994)
         0x87d71cdc0    la.la0036 LA41 (30.416481,-89.784994)
#5      *0x8fe902a88    la.la0036 LA434 (30.427176,-89.885888)
         0x8ff582240    la.la0434 LA36 (30.427176,-89.885888)
         0x8ff582240    la.la0036 LA434 (30.427176,-89.885888)
#6      *0x8fe9725e8    la.la0041 FirRd (30.465284,-89.811687)
#7      *0x8fe9724a8    la.la0041 LA36 (30.416481,-89.784994)
         0x87d71cdc0    la.la0041 LA36 (30.416481,-89.784994)
         0x87d71cdc0    la.la0036 LA41 (30.416481,-89.784994)
#8      *0x8fe972728    la.la0041 LA435 (30.534172,-89.868121)
         0x8ff582340    la.la0435 LA41 (30.534172,-89.868121)
         0x8ff582340    la.la0041 LA435 (30.534172,-89.868121)
#9      *0x8fe9727c8    la.la0041 LA435Spr (30.542387,-89.875653)
         0x8ff582420    la.la0435sprtal LA41 (30.542387,-89.875653)
         0x8ff582420    la.la0041 LA435Spr (30.542387,-89.875653)
#10     *0x8fe972688    la.la0041 MaxMerRd (30.499715,-89.814252)
#11     *0x8fe972548    la.la0041 TeatBlaRd (30.438203,-89.787827)
#12     *0x8ff53e288    la.la0434 LA36 (30.427176,-89.885888)
         0x8ff582240    la.la0434 LA36 (30.427176,-89.885888)
         0x8ff582240    la.la0036 LA434 (30.427176,-89.885888)
#13     *0x8ff53e5a8    la.la0435 +X335304 (30.534154,-89.902582)
#14     *0x8ff53e508    la.la0435 HilBlvd (30.50565,-89.97365)
#15     *0x8ff53e6e8    la.la0435 LA41 (30.534172,-89.868121)
         0x8ff582340    la.la0435 LA41 (30.534172,-89.868121)
         0x8ff582340    la.la0041 LA435 (30.534172,-89.868121)
#16     *0x8ff53e648    la.la0435 LA435Spr (30.532564,-89.875481)
         0x8ff582360    la.la0435sprtal LA435 (30.532564,-89.875481)
         0x8ff582360    la.la0435 LA435Spr (30.532564,-89.875481)
#17     *0x8ff53e8c8    la.la0435sprtal LA41 (30.542387,-89.875653)
         0x8ff582420    la.la0435sprtal LA41 (30.542387,-89.875653)
         0x8ff582420    la.la0041 LA435Spr (30.542387,-89.875653)
#18     *0x8ff53e828    la.la0435sprtal LA435 (30.532564,-89.875481)
         0x8ff582360    la.la0435sprtal LA435 (30.532564,-89.875481)
         0x8ff582360    la.la0435 LA435Spr (30.532564,-89.875481)
#19     *0x900080048    la.la1088 LA36 (30.433911,-89.91344)

What are we looking at here?

When does it happen?

How early in the process?

In the siteupdate.log excerpt above, note how when detecting colocations while reading in WPTs for usala, both la.la1088 LA36 (30.433911,-89.91344) and la.la0036 LA1088 (30.433911,-89.91344) have the same value in their colocated pointer. Just to show I did it right and am not going crazy, here's the code the produces that notice:

diff --git a/siteupdate/cplusplus/classes/WaypointQuadtree/WaypointQuadtree.cpp b/siteupdate/cplusplus/classes/WaypointQuadtree/WaypointQuadtree.cpp
index 334524e..296c17d 100644
--- a/siteupdate/cplusplus/classes/WaypointQuadtree/WaypointQuadtree.cpp
+++ b/siteupdate/cplusplus/classes/WaypointQuadtree/WaypointQuadtree.cpp
@@ -58,6 +58,19 @@ void WaypointQuadtree::insert(Waypoint *w, bool init)
                                }
                                other_w->colocated->push_front(w);
                                w->colocated = other_w->colocated;
+
+                               // <<< DEBUG
+                               if (w->str() == "la.la1088 LA36 (30.433911,-89.91344)")
+                               {       std::cout << "DEBUG:  this=la.la1088 LA36 (30.433911,-89.91344)" << std::endl;
+                                       std::cout << "DEBUG:  this->colocated=" << (void*)w->colocated << std::endl;
+                                       std::cout << "DEBUG: other->colocated=" << (void*)other_w->colocated << std::endl;
+                               }
+                               if (other_w->str() == "la.la1088 LA36 (30.433911,-89.91344)")
+                               {       std::cout << "DEBUG: other=la.la1088 LA36 (30.433911,-89.91344)" << std::endl;
+                                       std::cout << "DEBUG:  this->colocated=" << (void*)w->colocated << std::endl;
+                                       std::cout << "DEBUG: other->colocated=" << (void*)other_w->colocated << std::endl;
+
+                               }// DEBUG >>>
                        }
                }
                if (!w->colocated || w == w->colocated->back())

What happens after this?

Route::read_wpt finishing brings us to...

siteupdate.cpp

      #ifdef threading_enabled
    // stuff that never happens in siteupdateST because of the #ifdef
      #else
    for (HighwaySystem* h : highway_systems)
    {   std::cout << h->systemname << std::flush;
        bool usa_flag = h->country->first == "USA";
        for (Route* r : h->route_list)
            r->read_wpt(&all_waypoints, &el, args.highwaydatapath+"/hwy_data", usa_flag, &all_wpt_files);
        std::cout << "!" << std::endl;
    }
      #endif

    // DEBUG //
    list<Waypoint*> schmist = all_waypoints.point_list();
    schmist.sort();
    ofstream schmile("schmile");
    for (Waypoint *w : schmist)
    {   schmile << &w->colocated << '\t';
        schmile << w->colocated << '\t';
        if (!w->colocated) schmile << '\t';
        schmile << w->str() << std::endl;
    }

    cout << et.et() << "Sorting waypoints in Quadtree." << endl;

Top: existing code in main, for context, as "Reading waypoints for all routes." finishes off. Bottom: new debug code that iterates through all Waypoint objects by memory address, listing for each:

What do we see here?

[yakra@noreaster ~/TravelMapping/DataProcessing/siteupdate/cplusplus]$ egrep 'la.la1088 LA36|la.la0036 LA1088'  schmile
0x8fe9029e8 0x9000aee40 la.la0036 LA1088 (30.433911,-89.91344)
0x900080048 0xaee40 la.la1088 LA36 (30.433911,-89.91344)

It'd seem that nothing else happens after WaypointQuadtree insertion before this sanity check, and that this one pointer's value magically changes. I'd almost have no rational explanation other than possibly bad RAM, but not all is necessarily as it seems. To Be Continued...

yakra commented 3 years ago

Updating the repos to newer commits... HD @ e21d08b17773ae508b858220d23c6c4b9d830165 UD @ c9da0fa30ee592cad2bd9ee7c323b27c057093af ...siteupdate & siteupdateST crash when reading ausvictd.csv. This file only dates to Sun Jan 10 18:05:50 2021 +0000 per git log, so it's unrelated to crashes happening with June's data....

This problem persists even when compiled with g++ on noreaster. Does it also happen on...


yakra commented 3 years ago

New issue #402 for the chopped routes CSV bug. It should be addressed before any further investigation into the colocated pointer bug.

yakra commented 3 years ago

403 closes #402.

The issue above remains. I'll put this back on the backburner while tying up those other 2 projects.

yakra commented 3 years ago

Home come the pointer is fine if located < 0x900000000, but gets mangled if located > 0x900000000? (Worth noting that the g++ version doesn't store Waypoints in this memory range -- more 0x84####### ballpark. Could this be why it doesn't crash?)

Narrowing down, with main returning 0 after sorting quadtree. Assuming addresses are lower due to fewer instructions in memory

address status HwyData
commit
tweaks
0x8ffefbc08 OK 60d774f distance_update disabled
0x8fffb83c8 OK 60d774f first 3 datachecks, invalid_ends, selfref, slashes, sharp_angle disabled
0x8fffd2508 OK 60d774f first 3 datachecks, invalid_ends thru slashes, sharp_angle disabled
0x8fffe6928 OK 60d774f first 3 datachecks, invalid_ends thru lacks_generic, sharp_angle disabled
0x8fffe7508 OK 60d774f annoying US70
0x8fffe7b48 OK 60d774f ↑ all but last 3 records of annoying US60
0x8fffe7f08 crash 60d774f ↑ all but last 2 records of annoying US60
0x8fffe8c08 crash 60d774f ↑ all of annoying US60
0x8ffff6968 crash 60d774f ↓ deleted ME166A
0x8ffff9be8 crash 60d774f first 3 datachecks, invalid_ends, selfref, slashes disabled
0x8ffff9d28 crash 60d774f first 3 datachecks, invalid_ends, selfref disabled
0x90004ba68 crash 60d774f first 3 datachecks, invalid_ends disabled
0x90004bb08 crash 60d774f first 3 datachecks disabled
0x900052048 crash 60d774f no changes to read_wpt.cpp
yakra commented 3 years ago

If it's something in the code, it's in WaypointQuadtree::insert...

...Or is it? In Waypoint.h, moving std::list<Waypoint*> *colocated; earlier in the class definition, the 2nd item after only Route *route;, makes the problem go away. Was it a failure of a pointer to fall right on an 8-byte boundary, causing wacky antics?

yakra commented 2 years ago

A full pass thru siteupdate with valgrind will be worthwhile. It reported there were some errors / things to look at, almost all of which I glossed over in the rush to solve #443. It'll be good to see what may be lurking in the shadows. And maybe my memory management (and whatever else) is a bit sloppy.

jteresco commented 2 years ago

"[your] memory management" -> "C++'s memory management"

yakra commented 2 years ago

Heh. My memory mgmt too, considering what I've done (and not done) knowing the language's shortcomings. Like, take all the

foo* bar = new foo;
       // deleted on termination of program

littered throughout the source. I explicitly delete very little of it. Waypoints, HighwaySegments, HighwayGraph components, TravelerLists? I wait till the program ends, and System Monitor says I magically get all those gigs of space back, and my RAM use is back where it was before I ran siteupdate. (Don't know if that's a compiler thing, or an OS thing, or how that works...)

Looks like Valgrind has something to say about all this. What I'm doing may not exactly be best practice. :)


Another minor surprise I didn't look into was something about a branch relying on uninitialized data. Edge construction or compression IIRC. A bit of a surprise; I'd think I'd notice some effects...

yakra commented 2 years ago

Another minor surprise I didn't look into was something about a branch relying on uninitialized data. Edge construction or compression IIRC. A bit of a surprise; I'd think I'd notice some effects...

It happens exactly once, for US19TrkPit. https://github.com/TravelMapping/DataProcessing/blob/4d988b326b2c2069a71b43177c4c1721859ac0dd/siteupdate/cplusplus/classes/GraphGeneration/HGEdge.cpp#L23-L35 format needs to be initialized before the destructor is called.

We get 3 errors when the destructor calls detach. The branch itself is no-harm-no-foul: we either do or don't search the endpoint's adjacency lists for this, and find nothing. Then one more error: half the time we'll neglect to delete the s_written array, and leak 1 byte per thread.

yakra commented 2 years ago

Then one more error: half the time we'll neglect to delete the s_written array, and leak 1 byte per thread.

Nope. We neglect to delete it all the time, even with the fix in place, leaking 1 byte per thread per edge. (As of HwyData@0a07f53 & UserData@412dfaf2, that's 932211 bytes per thread.)

The destructor calls detach, which zeros out format, thus if (format & simple) delete[] s_written; never happens.

Best to delete s_written when the edge is detached from the simple graph. Similarly deleting c_written and t_written upon detach avoids having them hanging around being useless until the edge is deleted. This eliminates the need for an explicitly defined HGedge destructor altogether. Cleaner code for the win.

Better yet? Separate simple/collapsed/traveled arrays waste RAM when only one bit of each element is needed. Instead, have one single written array & use a bitmask. Adaptation should be easy; most of the heavy lifting is already done.

qty format description
299120 1 simple only
26 2 collapsed only
0 3 simple + collapsed
These should never exist. Any vertices collapsed for the
traveled graph are also collapsed for the collapsed graph.
15 4 traveled only
38 5 simple + traveled
109974 6 collapsed + traveled
523038 7 simple + collapsed + traveled
Everything built by the first constructor starts out as one of these.
932211 Total all edges in all master graphs
jteresco commented 1 year ago

We're there!