Open yakra opened 5 years ago
Is the goal to have this added to some log files? If not, I think what you are suggesting is much more easily and efficiently achieved with SQL queries to the DB. The information all there in the "clinched" table, it's just a matter of formulating the right queries. For example, "how many segments are marked as clinched by terescoj?" would be answered by
select COUNT(*) from clinched where traveler='terescoj';
which gives me my number, 59777.
Counting the travelers who have clinched a given segment:
'select COUNT(DISTINCT traveler) from clinched where segmentid='32232';
gives
+--------------------------+ | COUNT(DISTINCT traveler) | +--------------------------+ | 5 | +--------------------------+
or if you want a list of the travelers of a segment:
select DISTINCT traveler from clinched where segmentid='32232';
which gives
+----------+ | traveler | +----------+ | julmac | | mapmikey | | oscar | | w9tam | | opspe | +----------+
I've been meaning for a long time now to experiment with queries that would allow us to have the kinds of things like CHM had on the "top stats" page. I'm extremely grateful that you've taken the time to learn Python and the ins and outs of siteupdate.py. It would be great if we could find someone who's much better at SQL than I am or is willing and able to learn who will, first, probably point and laugh at my DB design, but then who would also fix it up and be able to formulate the queries to provide all kinds of new ways to view our data.
Also, beware of assuming much benefit from threads in Python. While noreaster has 20 physical cores and 40 hyperthreaded cores visible to the OS, Python's Global Interpreter Lock greatly limits its ability to take advantage of them with threads. Threads can be used (and are, in siteupdate) to overlap computation with I/O, but when looking for pure computational speedup, it's nearly useless.
Sorry, meant to post this over on yakra/DataProcessing instead. Here, the link to the code snippet worked. ;(
This post was mostly just a long-winded "note to self" -- the goal is to remember to pay attention to how the clinched segment data are recorded in siteupdate. If we could get away with storing them only once, that would be a Good Thing in terms of CPU & RAM.
The question "What segments have I clinched that nobody else has?" isn't even one I'm too interested in answering (it would be easy with the info, code & knowledge I already have; I just haven't tried to), but just something I thought of in a brief tangent of daydreaming. I realized it could be more easily answered by recording the data that one particular way, so I jotted it down.
Also, beware of assuming much benefit from threads in Python. While noreaster has 20 physical cores and 40 hyperthreaded cores visible to the OS, Python's Global Interpreter Lock greatly limits its ability to take advantage of them with threads. Threads can be used (and are, in siteupdate) to overlap computation with I/O, but when looking for pure computational speedup, it's nearly useless.
Indeed. I remember making some early attempts at improving multithreading, not getting the desired results, and having a basic understanding why. Then you & @Thing342 mentioned the GIL, I read up on that, and had a better understanding.
Hence my mentions of C++ above. ;)
I've been rewriting siteupdate in C++. It's been fun, and a good opportunity to learn new things in the process. I've just finished off .list file processing (hence a post on the topic). Single-threaded, this task took about 21 s (siteupdate.py, ~43 s on my machine IIRC). Multithreading got it down to 6.5 s.
It took a bit to remember to get my mutexes in place to avoid concurrent writes to the same data structures. While troubleshooting this, I asked "Why not store segments per traveler, instead of travelers per segment?", only to later find your comment and realize we've already done that -- I'd blindly translated your self.clinched_segments.add(hs)
to clinched_segments.insert(hs);
and forgotten about it. :)
Ah, C++! That would be excellent, and very multithreaded. I'd be very happy to see what all those cores could do with real thread support.
When we get to graph generation, that should be fun. C++ being faster out of the box, plus 4 cores on my machine to play around with, should make for some good speed improvements.
There's still much work to be done, though. ...and maybe more work than I'd thought/hoped?
Right now, I'm investigating froggie's report of my pioneers.log not working, and praying that the segments he's been on are just concurrency augments, not explicitly .listed. I'm adding some code to siteupdate.py to produce pioneers.log at the same point in the process, and then I'll see what I get for DIFFs...
So yeah, that explains all my spamming GitHub with issues. I've indeed been going thru the code with a fine-tooth comb.
Whether we end up with a functional C++ replacement or not, this detailed review of the existing code is very helpful.
I have come close a couple times since the first complete Python version to rewriting everything in Java or even C, but when I get close enough to consider the size of the task, I back out and work on something else.
What's so bad with Python?
Very slow.
Sure, I can read. But why is it slow? Is it a general disadvantage of Python comparted to C/JS or is it just the wrong tool for this purpose?
General disadvantage of Python compared to C (and Java?). Mostly due to Python being an interpreted language. ...Perhaps also due to its type system?
yep, compiled code is executed faster
Yes, the type system can be problematic for a project of this size. Easy to have bugs linger that would be picked up by a compiler that makes sure you use variables consistent with their declarations. Even so, I think Python works well enough for us that I wouldn't put the time in to rewriting at this point. But I in no way discourage @yakra from doing so. Speeding up and bulletproofing the site update process would help make automatic updates less resource-intensive and less likely to fail and leave the site in a bad state.
Thanks, got it. I'm only used to work with a very strict language and wasn't really aware of this point. Having a compiler system was always a point when our customers compared us to our competitor with interpreter system. Pros in runtime but cons in engineering (time).
Taking another look at this old issue with an eye toward answering the question the OP takes on -- Do we need both containers? First, a few comments...
Working from a somewhat outdated version of the HighwayData repo here locally, with 4281202 traveler-segments. Just storing the memory addresses of the objects in a container, at 8 bytes a pop, is > 32 MB. If we were able to store this once instead of twice, that'd be dandy.
Today: HighwayData @ 733b942daffc17571ef5ee13b9bc0be29ab24b20 UserData @ dfb4fbd3069b521e1ceec9b4c59ec3a93d6456ea 6772802 traveler-segments is > 51 MB, not counting the overhead of the containers themselves.
From a RAM use perspective considering the overhead of the containers themselves, it makes more sense to store the segments inside a
TravelerList
than to store the travelers inside aHighwaySegment
.
RAM footprint notwithstanding, it makes sense operationally to store the travelers inside a HighwaySegment
. More details below.
Another RAM advantage comes into play if we want to benefit from multithreading in C++. ... For any significant time savings, we need to have a unique mutex for each
HighwaySegment
object. Writing only toTravelerList
s, we don't need these.
If keeping HighwaySegment::clinched_by
, TravelerList
still needs a mutex either way, for when we're Processing traveler list files.
On the downside is CPU time. Thinking out loud...
No longer relevant. Sets, not lists, Hashing, not iterating. HighwaySegment.clinched_by was converted to a set in Spring 2019 during traveled graph implementation. TravelerList.clinched_segments has been a set since 2015.
What gets used where when why how
HighwaySegment::clinched_by should be kept.
TravelerList::clinched_segments OTOH?
Could be done now
TravelerList::clinched_segments
? Yes.Concurrency augments & crediting active/preview mileage per-traveler can be refactored into HighwaySegment::add_clinched_by
, and done right when a clinched segment is added. No need to keep a container of TravelerList->HighwaySegment pointers around to iterate through later.
The only difference should be concurrencies.log listing more augments, because travelers will be instantly credited for concurrencies, before the chance to read the next .list line.
All this thread stuff refers to C++. This is a little more easily done there, with most of the heavy lifting of refactoring already done, via splitting up the "Computing stats" process into 2 separate functions.
The only difference should be concurrencies.log listing more augments, because travelers will be instantly credited for concurrencies, before the chance to read the next .list line.
Nope. Logs:
yakra.log:
4c4
< Processed 967 good lines marking 13055 segments traveled.
---
> Processed 967 good lines marking 14325 segments traveled.
Both numbers are correct, just for different reasons: Before: only those segments explicitly .listed After: includes detected concurrencies
I see 3 ways around this:
Still relevant?
A big improvement staring me right in the face I've somehow overlooked:
We know whether a given segment/traveler combo is unique based on the result from HighwaySegment::add_clinched_by
. Thus TravelerList::clinched_segments
can become a vector that we only add to if add_clinched_by
returns True.
Surprisingly big time savings:
clinched_segments
.
At least iterating through a vector will be less expensive than iterating through an unordered set though.
C++ only. Python doesn't have the extra insertions, because I never changed the way Computing stats iterated, because single threaded.TravelerList
of 16 B on FreeBSD, 32 B on Ubuntu 16.04 (DK about Ubuntu 18.04 or CentOS) plus whatever the "hidden" costs of std::unordered_set
allocated on the heap are.My first thought was to let this wait a bit and keep a cleaner commit history, as some commits for the region.php rankings bug conflict with this. But why not preserve my sanity and get it in sooner. It can serve as the new basis of comparison; it may affect what alternative I eventually select for that.
See also #580
https://github.com/TravelMapping/DataProcessing/blob/b8af79317d9ccdb21896b04da34c0a5e12b09658/siteupdate/python-teresco/siteupdate.py#L1002-L1004 Leaving a reminder to continue to pay attention to this as I pore through the code.
Working from a somewhat outdated version of the HighwayData repo here locally, with 4281202 traveler-segments. Just storing the memory addresses of the objects in a container, at 8 bytes a pop, is > 32 MB. If we were able to store this once instead of twice, that'd be dandy.
From a RAM use perspective considering the overhead of the containers themselves, it makes more sense to store the segments inside a
TravelerList
than to store the travelers inside aHighwaySegment
. The former case would require ~231 containers for ~231 travelers; the latter ~723010 containers for ~723010 segments. That's a difference of 722779 containers. With a C++ list taking up 24 bytes on my system (I don't know how much space the containers use in Python), that's > 16 MB.Another RAM advantage comes into play if we want to benefit from multithreading in C++. Processing multiple
TravelerList
s concurrently, we need to put a mutex (a Lock, in Pythonese) aroundHighwaySegment::add_clinched_by
. For any significant time savings, we need to have a unique mutex for eachHighwaySegment
object. Writing only toTravelerList
s, we don't need these. 723010 total segments 40 bytes per mutex = > 27 MB savings. Getting rid of the mutex means one process doesn't having to wait for another to write to a given object, means time savings. Small* time savings in practice, but more than zero.On the downside is CPU time. Thinking out loud... When a traveler-segment is matched during .list processing, we must check to see if it's already stored as clinched for that traveler. • Storing segments inside a
TravelerList
means iterating through longer lists. Starting with a list of 0 elements & ending with a list of, on average, 18533 segments = iterating through a list of 9267 elements on average. • Storing travelers inside aHighwaySegment
means iterating through shorter lists, max 128 elements.How much time difference would this work out to in practice? Don't know. The old RAM vs. CPU trade-off...
I'll have to see what's practicable, and what's needed where in the code...
A couple arguments for using the more RAM-intensive method: • Concurrency augmentation already makes use of it, IIRC. • More convenient for answering questions like "What segments have I clinched that nobody else has?"
If any case, if we're able to use one method instead of both, that can save some RAM.