abrignoni / iLEAPP

iOS Logs, Events, And Plist Parser
MIT License
730 stars 152 forks source link

Large Quantity Data Set Handling #607

Open JamesHabben opened 11 months ago

JamesHabben commented 11 months ago

Problem

overall, there are modules that are or have potential for parsing out a large number of data records, and writing those records into HTML tags creates a large overhead of both file storage and processing, leading to bloated reports and potential non-load issues on less powerful computers.

Data

I noticed that the health - heart rate output from Josh's public image creates around 23.5k records and a 10mb html file. it is also timing out on some of my computers due to memory. the health - steps is 15.5k records and a 4mb html file.

Solution

i think we can address this with a relatively low impact change by loading data from either a separate json file, or possibly a sqlite db file. json would be a lower impact to the code base since it would be native to javascript. i am exploring with the structure of having an option to have a module write the data to a json file and have the html file load it when rendered. it wont be 'true' json since that typically has the field names in front of every value for every record. instead, an array of data rows, that are themselves just array of data fields will drastically reduce the size of the data set.

true json

[
  {
    "field1": "value1",
    "field2": ""
  }
]

array of array

[
  [ "value1", ""]
}

Tag: @Johann-PLW , @abrignoni

JamesHabben commented 11 months ago

here is a really rough pass. it errors on some modules, but it processes enough to do some testing. https://github.com/JamesHabben/iLEAPP/tree/dynamicreport-dataarrays

the health - heart rate module gets 23.5k records from josh's public image. previous html file was around 10-11mb. using this branch, that file cuts down to around 6mb. on the larger file, my browser was timing out with all the data. with this branch, i get 12 loops of the timer circle, and the data is all loaded. once it loads, the pages are actually quite responsive when moving around the pages. its also quite quick when using the sorts. its just the initial load that takes some time.

@Johann-PLW can you try this against your larger data set to see if it helps?

JamesHabben commented 11 months ago

i am less concerned about the size of the html file itself. this is more about the browser being able to load and work with the large amount of data. we could do some other things to do some file reduction, but doing something like gzip compression might actually add overhead to the browser making it worse. if this branch code doesnt work to load all of your large data set, we may need to explore other approaches of breaking that data up into segments.

Johann-PLW commented 11 months ago

If this branch code doesnt work to load all of your large data set, we may need to explore other approaches of breaking that data up into segments.

That's something I've actually been thinking about, perhaps grouping a large amount of data by year, day, month or hour. With the work you've already done on the HTML report, we could display less info on the screen and additional details that are redundant in the form of a tooltip.

I'll do some tests with my dataset with the code of your 'dynamicreport-dataarray' branch and let you know.

Johann-PLW commented 11 months ago

@JamesHabben Unfortunately, it doesn't work with my personal dataset (encrypted backup of an iPhone 13 mini with iOS 17.0.3)

The heart rate query matches 1028115 records. The previous generated HTML file was 240.9 MB, the new one (with your updated code) is 235.1 MB. The web browsers (Safari & Chrome) are still unresponsive after 10 minutes trying to load the data.

The number of steps query matches 493272 records. The previous generated HTML file was 66.7 MB, the new one is bigger with 81.5 MB Both web browsers are unresponsive.

Tests were conducted on a MacBook Pro 2019 - 2,4 GHz Intel Core i9 8 cores - 32 GB RAM with macOS 13.5.1 Web browsers are Safari 6.6 (18615.3.12.11.2) et Google Chrome 119.0.6045.159

JamesHabben commented 11 months ago

oof. not sure why heart rate didnt reduce more, and frustrated at steps increase. i can reduce some of that using less text in the structure, but i dont think that will make much different in the browser loading this data set. what do you think about sampling the data on the python side? 1mil records is a lot of data and will be hard to incorporate processing it in a broader framework like this. i wonder if we can find a framework that can do some time based sampling, averaging, anomoly highlight, and pass a reduced set of data to the browser.

JamesHabben commented 11 months ago

@Johann-PLW What's the time range and frequency of your heart beat data? If we did some summary of data, say every 15 mins, how many records would that reduce? Might have to adjust based on the frequency.

We can provide typical summary numbers like minimum, maximum, average, mean, etc. and if the user wants to investigate in more detail, then TSV output is available.

While typing this though, I wanted to do some math. I think hourly summary periods really might need to be the one.

Here are my calcs:

1.  Hourly Records:
•   1 record per hour
•   24 records per day
•   In 3 years:  = 26,280 records
•   In 5 years:  = 43,800 records
•   In 10 years:  = 87,600 records
2.  Half-hourly Records:
•   2 records per hour
•   48 records per day
•   In 3 years:  = 52,560 records
•   In 5 years:  = 87,600 records
•   In 10 years:  = 175,200 records
3.  Every 15 Minutes:
•   4 records per hour
•   96 records per day
•   In 3 years:  = 105,120 records
•   In 5 years:  = 175,200 records
•   In 10 years:  = 350,400 records
Johann-PLW commented 11 months ago

@JamesHabben I have records since April 2015 and the frequency depends of my activity:

I think we could also remove some columns like Device and Manufacturer

image

As the device and/or software used to collect the data, and the timezone are very repetitive, could we use an array to store once the information and display it in all records?