TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
4 stars 6 forks source link

recording per-traveler clinched segments #151

Open yakra opened 5 years ago

yakra commented 5 years ago

https://github.com/TravelMapping/DataProcessing/blob/b8af79317d9ccdb21896b04da34c0a5e12b09658/siteupdate/python-teresco/siteupdate.py#L1002-L1004 Leaving a reminder to continue to pay attention to this as I pore through the code.

Working from a somewhat outdated version of the HighwayData repo here locally, with 4281202 traveler-segments. Just storing the memory addresses of the objects in a container, at 8 bytes a pop, is > 32 MB. If we were able to store this once instead of twice, that'd be dandy.

From a RAM use perspective considering the overhead of the containers themselves, it makes more sense to store the segments inside a TravelerList than to store the travelers inside a HighwaySegment. The former case would require ~231 containers for ~231 travelers; the latter ~723010 containers for ~723010 segments. That's a difference of 722779 containers. With a C++ list taking up 24 bytes on my system (I don't know how much space the containers use in Python), that's > 16 MB.

Another RAM advantage comes into play if we want to benefit from multithreading in C++. Processing multiple TravelerLists concurrently, we need to put a mutex (a Lock, in Pythonese) around HighwaySegment::add_clinched_by. For any significant time savings, we need to have a unique mutex for each HighwaySegment object. Writing only toTravelerLists, we don't need these. 723010 total segments 40 bytes per mutex = > 27 MB savings. Getting rid of the mutex means one process doesn't having to wait for another to write to a given object, means time savings. Small* time savings in practice, but more than zero.

On the downside is CPU time. Thinking out loud... When a traveler-segment is matched during .list processing, we must check to see if it's already stored as clinched for that traveler. • Storing segments inside a TravelerList means iterating through longer lists. Starting with a list of 0 elements & ending with a list of, on average, 18533 segments = iterating through a list of 9267 elements on average. • Storing travelers inside a HighwaySegment means iterating through shorter lists, max 128 elements.

How much time difference would this work out to in practice? Don't know. The old RAM vs. CPU trade-off...

I'll have to see what's practicable, and what's needed where in the code...

A couple arguments for using the more RAM-intensive method: • Concurrency augmentation already makes use of it, IIRC. • More convenient for answering questions like "What segments have I clinched that nobody else has?"

If any case, if we're able to use one method instead of both, that can save some RAM.

jteresco commented 5 years ago

Is the goal to have this added to some log files? If not, I think what you are suggesting is much more easily and efficiently achieved with SQL queries to the DB. The information all there in the "clinched" table, it's just a matter of formulating the right queries. For example, "how many segments are marked as clinched by terescoj?" would be answered by

select COUNT(*) from clinched where traveler='terescoj';

which gives me my number, 59777.

Counting the travelers who have clinched a given segment:

'select COUNT(DISTINCT traveler) from clinched where segmentid='32232';

gives

+--------------------------+
| COUNT(DISTINCT traveler) |
+--------------------------+
|                        5 |
+--------------------------+

or if you want a list of the travelers of a segment:

select DISTINCT traveler from clinched where segmentid='32232';

which gives

+----------+
| traveler |
+----------+
| julmac   |
| mapmikey |
| oscar    |
| w9tam    |
| opspe    |
+----------+

I've been meaning for a long time now to experiment with queries that would allow us to have the kinds of things like CHM had on the "top stats" page. I'm extremely grateful that you've taken the time to learn Python and the ins and outs of siteupdate.py. It would be great if we could find someone who's much better at SQL than I am or is willing and able to learn who will, first, probably point and laugh at my DB design, but then who would also fix it up and be able to formulate the queries to provide all kinds of new ways to view our data.

jteresco commented 5 years ago

Also, beware of assuming much benefit from threads in Python. While noreaster has 20 physical cores and 40 hyperthreaded cores visible to the OS, Python's Global Interpreter Lock greatly limits its ability to take advantage of them with threads. Threads can be used (and are, in siteupdate) to overlap computation with I/O, but when looking for pure computational speedup, it's nearly useless.

yakra commented 5 years ago

Sorry, meant to post this over on yakra/DataProcessing instead. Here, the link to the code snippet worked. ;(

This post was mostly just a long-winded "note to self" -- the goal is to remember to pay attention to how the clinched segment data are recorded in siteupdate. If we could get away with storing them only once, that would be a Good Thing in terms of CPU & RAM.

The question "What segments have I clinched that nobody else has?" isn't even one I'm too interested in answering (it would be easy with the info, code & knowledge I already have; I just haven't tried to), but just something I thought of in a brief tangent of daydreaming. I realized it could be more easily answered by recording the data that one particular way, so I jotted it down.

Also, beware of assuming much benefit from threads in Python. While noreaster has 20 physical cores and 40 hyperthreaded cores visible to the OS, Python's Global Interpreter Lock greatly limits its ability to take advantage of them with threads. Threads can be used (and are, in siteupdate) to overlap computation with I/O, but when looking for pure computational speedup, it's nearly useless.

Indeed. I remember making some early attempts at improving multithreading, not getting the desired results, and having a basic understanding why. Then you & @Thing342 mentioned the GIL, I read up on that, and had a better understanding.

Hence my mentions of C++ above. ;)

I've been rewriting siteupdate in C++. It's been fun, and a good opportunity to learn new things in the process. I've just finished off .list file processing (hence a post on the topic). Single-threaded, this task took about 21 s (siteupdate.py, ~43 s on my machine IIRC). Multithreading got it down to 6.5 s. It took a bit to remember to get my mutexes in place to avoid concurrent writes to the same data structures. While troubleshooting this, I asked "Why not store segments per traveler, instead of travelers per segment?", only to later find your comment and realize we've already done that -- I'd blindly translated your self.clinched_segments.add(hs) to clinched_segments.insert(hs); and forgotten about it. :)

jteresco commented 5 years ago

Ah, C++! That would be excellent, and very multithreaded. I'd be very happy to see what all those cores could do with real thread support.

yakra commented 5 years ago

When we get to graph generation, that should be fun. C++ being faster out of the box, plus 4 cores on my machine to play around with, should make for some good speed improvements.

There's still much work to be done, though. ...and maybe more work than I'd thought/hoped?

Right now, I'm investigating froggie's report of my pioneers.log not working, and praying that the segments he's been on are just concurrency augments, not explicitly .listed. I'm adding some code to siteupdate.py to produce pioneers.log at the same point in the process, and then I'll see what I get for DIFFs...

So yeah, that explains all my spamming GitHub with issues. I've indeed been going thru the code with a fine-tooth comb.

jteresco commented 5 years ago

Whether we end up with a functional C++ replacement or not, this detailed review of the existing code is very helpful.

I have come close a couple times since the first complete Python version to rewriting everything in Java or even C, but when I get close enough to consider the size of the task, I back out and work on something else.

michihdeu commented 5 years ago

What's so bad with Python?

yakra commented 5 years ago

Very slow.

michihdeu commented 5 years ago

Sure, I can read. But why is it slow? Is it a general disadvantage of Python comparted to C/JS or is it just the wrong tool for this purpose?

yakra commented 5 years ago

General disadvantage of Python compared to C (and Java?). Mostly due to Python being an interpreted language. ...Perhaps also due to its type system?

michihdeu commented 5 years ago

yep, compiled code is executed faster

jteresco commented 5 years ago

Yes, the type system can be problematic for a project of this size. Easy to have bugs linger that would be picked up by a compiler that makes sure you use variables consistent with their declarations. Even so, I think Python works well enough for us that I wouldn't put the time in to rewriting at this point. But I in no way discourage @yakra from doing so. Speeding up and bulletproofing the site update process would help make automatic updates less resource-intensive and less likely to fail and leave the site in a bad state.

michihdeu commented 5 years ago

Thanks, got it. I'm only used to work with a very strict language and wasn't really aware of this point. Having a compiler system was always a point when our customers compared us to our competitor with interpreter system. Pros in runtime but cons in engineering (time).

yakra commented 3 years ago

Taking another look at this old issue with an eye toward answering the question the OP takes on -- Do we need both containers? First, a few comments...

Working from a somewhat outdated version of the HighwayData repo here locally, with 4281202 traveler-segments. Just storing the memory addresses of the objects in a container, at 8 bytes a pop, is > 32 MB. If we were able to store this once instead of twice, that'd be dandy.

Today: HighwayData @ 733b942daffc17571ef5ee13b9bc0be29ab24b20 UserData @ dfb4fbd3069b521e1ceec9b4c59ec3a93d6456ea 6772802 traveler-segments is > 51 MB, not counting the overhead of the containers themselves.

From a RAM use perspective considering the overhead of the containers themselves, it makes more sense to store the segments inside a TravelerList than to store the travelers inside a HighwaySegment.

RAM footprint notwithstanding, it makes sense operationally to store the travelers inside a HighwaySegment. More details below.

Another RAM advantage comes into play if we want to benefit from multithreading in C++. ... For any significant time savings, we need to have a unique mutex for each HighwaySegment object. Writing only toTravelerLists, we don't need these.

If keeping HighwaySegment::clinched_by, TravelerList still needs a mutex either way, for when we're Processing traveler list files.

On the downside is CPU time. Thinking out loud...

No longer relevant. Sets, not lists, Hashing, not iterating. HighwaySegment.clinched_by was converted to a set in Spring 2019 during traveled graph implementation. TravelerList.clinched_segments has been a set since 2015.


Looking at the different structures...

What gets used where when why how

HighwaySegment::clinched_by should be kept.

TravelerList::clinched_segments OTOH?

Could be done now

Summary

yakra commented 2 years ago

Concurrency augments & crediting active/preview mileage per-traveler can be refactored into HighwaySegment::add_clinched_by, and done right when a clinched segment is added. No need to keep a container of TravelerList->HighwaySegment pointers around to iterate through later.

The only difference should be concurrencies.log listing more augments, because travelers will be instantly credited for concurrencies, before the chance to read the next .list line.

All this thread stuff refers to C++. This is a little more easily done there, with most of the heavy lifting of refactoring already done, via splitting up the "Computing stats" process into 2 separate functions.

yakra commented 2 years ago

The only difference should be concurrencies.log listing more augments, because travelers will be instantly credited for concurrencies, before the chance to read the next .list line.

Nope. Logs:

yakra.log:
4c4
< Processed 967 good lines marking 13055 segments traveled.
---
> Processed 967 good lines marking 14325 segments traveled.

Both numbers are correct, just for different reasons: Before: only those segments explicitly .listed After: includes detected concurrencies

I see 3 ways around this:

jteresco commented 1 year ago

Still relevant?

yakra commented 1 year ago

A big improvement staring me right in the face I've somehow overlooked: We know whether a given segment/traveler combo is unique based on the result from HighwaySegment::add_clinched_by. Thus TravelerList::clinched_segments can become a vector that we only add to if add_clinched_by returns True. Surprisingly big time savings:

My first thought was to let this wait a bit and keep a cleaner commit history, as some commits for the region.php rankings bug conflict with this. But why not preserve my sanity and get it in sooner. It can serve as the new basis of comparison; it may affect what alternative I eventually select for that.

yakra commented 1 year ago

See also #580