Improve File Import performance

m10653 commented 6 months ago

I noticed when working with large files Advantage scope tends to take seemly exponential time to import log (WPILOG) files.

This pull request is a very rough/quick and dirty start/working draft to speeding up the ingest process for advantage scope. I replaced the implementation for the LogField time and value arrays with a Sorted Set from js-sdsl. This PR has gotten basic functionality working but still has some issues with data viewing due to possibly incorrect implementations of clearBeforeTime, getRange, toSerialized and fromSerialized.

With this change I was able to get an improvement of 6x for smaller wpilog files (7mb) and 2x improvement in larger ones (40mb)

I ran the importer though a CPU profiler and it identified a few hot methods taking a large amount of the commutation time, putdata, createBlankField, and parseURCLR2.

I tried to limit any external api changes if at all possible, but would love input on what changes you would be open to, along with any insight on where to look deeper, as It still seems to slow down quite a bit as the files get larger. Maybe writing a parser in C++ would be a approach to getting much better performance?

m10653 commented 6 months ago

Updates. Moved away from Red black trees and ended up finding it was much faster to load in the data into memory and then sort and then process the data when sorted.
Discovered a few things, the optimization that removes duplicate repeating values had a issue where it assumed that records where coming in sequentially so if you got a record out of sink it would produce an incorrect array. It was also the main cause for the n^2 time complexity when importing log files. Doing all this processing at the end allows the code to make the assumption that all keys are ordered and speeds up the process a lot.

This current commit adds improves the log importing but now no longer correctly updates the log progress bar as that functionality is within the workers and not within the logField or log class. I think now the log field only only has valid values after a call to getSerilizable is a bit of a hack and might warrant some refracting, but I wanted to limit impact on other parts of the program for this POC.

This improved file loading times on a 70mb file from 4m to ~5-10 seconds.

Gold872 commented 6 months ago

I got bored and started playing around with this fork, and I really love the improved speed. My only complaint is that when LogField first calls sortAndProcess(), it can take a bit of time (like 5-10 seconds) while the progress bar remains at 100%. What I think could be useful is having some sort of message or progress bar saying "Processing Data".

I haven't been able to find any bugs yet (although the logs I was using don't have any weird anomalies), but I think this could be a big improvement to advantagescope.

m10653 commented 6 months ago

I got bored and started playing around with this fork, and I really love the improved speed. My only complaint is that when LogField first calls sortAndProcess(), it can take a bit of time (like 5-10 seconds) while the progress bar remains at 100%. What I think could be useful is having some sort of message or progress bar saying "Processing Data".

I haven't been able to find any bugs yet (although the logs I was using don't have any weird anomalies), but I think this could be a big improvement to advantagescope.

There is currently a few issues with this pr, it breaks streaming functionality im pretty sure as pointed out by jwbonner. It also uses 2x the memory as the old code and cases out of memory errors when processing data over 1gb ish. where as the prevous code would allow that to run even if it did take hours. The memory fix could just be working around it and allocating more memory that does work but is a bad solution.

A possible solution for the streaming issue is to split the log class into two one for streaming and one for file reading, and utilize different optimizations for both. Overall the architecture is quite challenging to deal with

Edit: the progress bar thing is known and there is a not to hard way to fix it. The problem is the current structure assumes that the file is done after all the data is put in the log (IE kept correct the entire time) so stuff would need to be modifed to allow for progress callbacks to still be called. The progress can be estimated by the number of calls made the the cmp call back for array.sort. Should be around nlogn

spacey-sooty commented 5 months ago

I tried to limit any external api changes if at all possible, but would love input on what changes you would be open to, along with any insight on where to look deeper, as It still seems to slow down quite a bit as the files get larger. Maybe writing a parser in C++ would be a approach to getting much better performance?

Writing a parser in C++ seems like somewhat of a pain to integrate with an Electron app to me. I think if performance becomes enough of a bottleneck a switch to something like Tauri allowing you to offload intensive tasks to a native backend would be a good change while allowing less changes than writing a full native implementation.

m10653 commented 2 months ago

@jwbonner I am running into a bit of difficulty with the structure of how the log class is set up. I attempted to split it up into two modes one where the log can take advantage of not being sorted at all times and sorted at the end and then a always sorted mode for when the logs are streaming.

This approach runs into problems with the custom Schemas for Photonvision and URCL. Some of these schemas attempt to read data from the log file and these calls can not be made on a unsorted log. (Least not efficiently). Currently the way that custom schemas are loaded is on a by record basis and then all the generated fields (there can be many) generated by a single wpilog record all have to be sorted. A more efficient approach would be to load in all the raw data (while retaining the type information) and then after processing and sorting parse the custom schemas. This way no sorting needs to be done on the resulting log fields as they would already be sorted.

I do have a working workaround to it that could be polished a bit (right now the only issue I see is the loading bar not correctly accounting for the parse time for custom schemas. ) I think it might be worth in the future reworking the structure of how log files are ingested and read and attempt to better decouple the file parsing and log handling code. Also likely split up the streaming and file loading logic. Maybe trying to utilize a common interface to a intermediary file format such as MCAP or a new revision of wpilog in the future.

jwbonner commented 2 months ago

Sorry that I haven't been keeping up with this very closely, but I really appreciate the work you've been doing.

I think the structure of separating live sorting from normal log loading makes sense. As log files get larger, I think it will ultimately be necessary to avoid storing the entire set of log data in memory. The issue is that it requires a log format that can be efficiently indexed, which would have to be an intermediate format as long as we're using the RIO (maintaining writing efficiency is of upmost importance). Electron also makes file IO a bit tricky, though I wonder if IndexedDB could be an option? Regardless I think that's a separate project from this PR.

In regard to the Photon/URCL schemas, maybe the parses could be turned into separate objects with persistent state (which would need to be serialized along with the log). They wouldn't need to be able to read back from the log, but could instead just store data as required and generate records as they go (with the assumption that records fed into them aren't necessarily in timestamp order, which I think is reasonable since they need to work with NT anyway). Would that allow you to do a single pass of reading records from the file while processing schemas, then a sort at the end?

I believe we talked about it before, but I think it's likely that I'll transition the rest of the codebase to deal with duplicate values in the log data. I've been working on a redesign for most of the tabs to add more sophisticated options, and being able to explicitly show individual values, process them for stats, etc. is likely to be helpful as long as the performance is reasonable. I'm not 100% sure that it will work yet, but if so that should provide a bit of a performance improvement for data ingestion (memory is potentially still a concern).

Mechanical-Advantage / AdvantageScope

Improve File Import performance #152