thehale commented 1 year ago

Feature Request

The currently used Attempt Library stores attempts in AsyncStorage which has a total limit of 6MB. Solvers will quickly exceed this threshold after as little as a few thousand solves (even fewer when also storing smartcube data). As such we need a more spacious place to store the attempt data.

This issue proposes using react-native-fs to store the attempts in CSV files on each device's local filesystem.

Proposed Implementation (See https://github.com/SpeedcuberOSS/speedcuber-timer/issues/68#issuecomment-1700349889)

/SpeedcuberTimer
|-- <event>
|   `-- summary.json
|   `-- attempts.jsonl
|   `-- hashes.json
|   `-- recordings/
|       `-- <attempt_1 id>.json
|       `-- <attempt_2 id>.json

Attempts are persisted per event in a folder named with the event's id (e.g. 333 for standard 3x3x3 solving). The STIF.Attempt data itself is stored in a file called attempts.jsonl -- jsonl is chosen so that new attempts can be appended to the file without reserializing the entire list of objects (edits will require some additional optimization).

To help with boot times for large attempt libraries, there will also be a summary.json which includes a cache of the latest values of key statistics (Ao5|12|50|100|100, PB, etc.) that will be shown on the home screen for the event.

As a validation mechanism, there will also be a hashes.json which will store a SHA1 hash of the summary and attempts.json

TODO evaluate performance impact of re-computing the hash on each save. If too great for attempts.jsonl, perhaps use a hashes.jsonl where each line is a hash of the corresponding line in attempts.jsonl.

Additionally, since the STIF specification no longer includes solve recordings in-line with an Attempt, solve recordings will be stored in a dedicated recordings/ folder with the name <attempt_id>.json where <attempt_id> is the id of the attempt associated with the recording.

This separation will drastically reduce the size of the attempts.jsonl file, resulting in faster boot times.
The recorded data will still be available on-demand (a better fit since its only needed when requesting a reconstruction for the first time OR when re-computing the reconstruction -- e.g. when gyroscope support is eventually added).

In this model, user-categories will be omitted in favor of custom events.

Alternative Implementation

The file system would look like the following

/SpeedcuberTimer
|-- <event>
|   `-- index.json
|   `-- <category>
|       `-- times.csv
|       `-- <attempt_1 id>.json
|       `-- <attempt_2 id>.json

Full Example:

The app will have a dedicated folder on the file system (the "Root Folder"). Within that folder will be one sub-folder for each event (an "Event Folder"). Within each Event Folder will be an index.json file alongside one sub-folder for each category of solves (a "Category Folder").

The index.json will contain a list of objects mapping the custom names of each user category to a specific folder in the file system. This choice will make it easier to support a Rename Category feature.

[
  {
    "category": "Default",
    "folder": "default"
  },
  {
    "category": "4x4 Practice",
    "folder": "776a1688-57fe-4d2f-947f-3b02f44b17f1"
  }
]

Each Category Folder will contain a times.csv file listing a few key fields for each attempt in that Event/Category. Additionally, the full details of each attempt will be stored in a collection of .json files, named by the attempt id.

The times.csv file will look like this:

id,timestamp,durationMillis,penalties,checksum
<uuid>,1234567890,14632,<bar `|` separated penalty strings>,<md5 hash of attempt data>

The id allows us to lookup the associated .json file with the full attempt data
The timestamp field allows us to sort attempts by recency without loading each attempt's full data.
The durationMillis and penalties fields allow us to sort attempts by duration and compute averages without loading each attempt's full data.
The checksum enables validation of the .json file containing the full attempt data.

Full Example:

```csv id,timestamp,durationMillis,penalties 3b05329b-6e5d-41d2-8241-08adbae61146,1234567000,14632,, c5fa1673-c4dd-4a00-bce8-b995ddd86f73,1234568000,13421,+2, 5e33fd46-3c02-4e80-99e8-1fea76c9cf55,1234569000,15612,DNF, ae6d883f-771a-4c06-95e3-9b0a8147fa6f,1234570000,11008,DNS, 329ce1a0-b85e-4258-a309-bfb0af2b36bc,1234571000,12514,+2|+2, ```

Potential Alternatives

1) Store the STIF Attempt data in-line with the CSV (e.g. encoded as base64)

Makes the full Attempt data immediately accessible.
base64 increases memory usage by 33%.
Parsing the CSV will take longer. 2) SQLite
Full power of SQL.
Backing up the db to a single file is generally built-in.
But, SQL... Do we really want to add another language to this project?
Harder for hobby power users to fiddle with.

Pros

Relatively easy to implement
It's easy to back-up a folder of CSVs locally or to a remote file storage.
CSVs are an easy format for power users to analyze on their own.
react-native-fs works across all operating systems. Many SQL datastores are Android/iOS exclusive.

Cons

Updating old records may be slow since large portions of the CSV may have to be re-written to disk
Merges must compare the entire CSV document.

thehale commented 1 year ago

~~Also consider adding an index.json to each folder listing all the known CSVs for the event and a checksum to enable detection of unexpected changes to the attempt data.~~

EDIT: index.json has been added to the specification. Checksums have been added directly to the times.csv.

thehale commented 1 year ago

Consider adding some extra columns to the CSV to cache relevant statistics (Ao5, Ao12, Ao50, Ao100, best, worst) similar to how a bank ledger stores the current balance alongside each transaction.

thehale commented 1 year ago

The implementation did not use the directory structure specified in this ticket. Instead, a large JSON list is read/written to the file system for every change to the attempt library.

Once the file reaches 0.5 MB (as little as a few dozen smart solves with gyroscope data) the application begins to behave strangely -- missing animations, slower frame refreshes, etc.

We need reduce the amount of file system writing to help improve the performance.

thehale commented 1 year ago

New Proposal

Background

The most common operations on a database of attempts are:

List the attempts by date and result.
Compute averages on the results (e.g. AoX)

Other datapoints, like the scramble, move count, etc. are much less important -- typically only viewed when opening an "attempt details" view. The solve recording, if it exists, is rarely accessed -- only whenever it's needed to re-compute a reconstruction.

We can optimize for these use cases by providing fast access to the date and result of each attempt then follow up with subsequent queries for more information.

Specification

/SpeedcuberTimer
|-- <event>
|   `-- results.csv
|   `-- details/
|       `-- <attempt_1 id>.json
|       `-- <attempt_2 id>.json
|   `-- recordings/
|       `-- <solution_1 id>.json
|       `-- <solution_2 id>.json

results.csv

In the format of id,inspectionStart,result

ee1e0de7-f729-451e-a9e1-88d81344541f,1693454067123,13456
4ae2e55c-c3f7-4e11-bd12-0537ea63533d,1693454245123,DNF

If we (safely) assume that each attempt takes less than 27.777... hours (i.e. 8 digits of milliseconds), then we can guarantee an upper bound of 60 characters on the line length.

36 characters of uuid
13 characters of timestamp
8 characters of result
2 commas
1 new line

So, every 1,000 solves will require 60kb of storage in this root csv. For comparison, a Twisty Timer export of 8728 3x3x3 solves uses 818.4kb, or 93.76kb per 1,000 solves (the extra space is used for quotes, a more verbose timestamp, and the scramble).

Saving new solves would only need to append to the file, as opposed to the current implementation which re-writes the entire file for every change to any attempt. Only edits to a historical attempt would require re-writing the file.

Given previous performance measures, this suggests that the app would only begin to exhibit sluggish behavior if it were constantly making changes to random solves' key data on datasets of greater than 5k attempts. Standard usage will exhibit no such sluggishness.

details folder

The full STIF JSON payload of each attempt will be saved in a details folder as a standalone file of the name attempt_id.json. Since the uuid id is stored in summaries.csv this means any attempt's details can be queried in O(1) time with respect to the number of attempts.

Writes (e.g. editing a comment) can also be completed in O(1).

recordings folder

Same logic as the details folder, but for solve-specific recordings.

Tradeoffs

By making attempt details only available after an extra file system read, batch statistics on other attempt data points (e.g. average inspection time, move count, etc.) will become more expensive unless included in the results.csv at the cost of more storage. This could be mitigated with additional indices at the cost of extra complexity when writing changes to an attempt.

SpeedcuberOSS / speedcuber-timer

Store Attempts in the File System #68

Feature Request

Proposed Implementation (See https://github.com/SpeedcuberOSS/speedcuber-timer/issues/68#issuecomment-1700349889)

Alternative Implementation

Potential Alternatives

Pros

Cons

New Proposal

Background

Specification

results.csv

details folder

recordings folder

Tradeoffs