SpeedcuberOSS / speedcuber-timer

The smart, offline-ready speedcubing Android/iOS app made for speedcubers, by speedcubers.
Mozilla Public License 2.0
4 stars 0 forks source link

Store Attempts in the File System #68

Closed thehale closed 11 months ago

thehale commented 1 year ago

Feature Request

The currently used Attempt Library stores attempts in AsyncStorage which has a total limit of 6MB. Solvers will quickly exceed this threshold after as little as a few thousand solves (even fewer when also storing smartcube data). As such we need a more spacious place to store the attempt data.

This issue proposes using react-native-fs to store the attempts in CSV files on each device's local filesystem.

Proposed Implementation (See https://github.com/SpeedcuberOSS/speedcuber-timer/issues/68#issuecomment-1700349889)

/SpeedcuberTimer
|-- <event>
|   `-- summary.json
|   `-- attempts.jsonl
|   `-- hashes.json
|   `-- recordings/
|       `-- <attempt_1 id>.json
|       `-- <attempt_2 id>.json

Attempts are persisted per event in a folder named with the event's id (e.g. 333 for standard 3x3x3 solving). The STIF.Attempt data itself is stored in a file called attempts.jsonl -- jsonl is chosen so that new attempts can be appended to the file without reserializing the entire list of objects (edits will require some additional optimization).

To help with boot times for large attempt libraries, there will also be a summary.json which includes a cache of the latest values of key statistics (Ao5|12|50|100|100, PB, etc.) that will be shown on the home screen for the event.

As a validation mechanism, there will also be a hashes.json which will store a SHA1 hash of the summary and attempts.json

Additionally, since the STIF specification no longer includes solve recordings in-line with an Attempt, solve recordings will be stored in a dedicated recordings/ folder with the name <attempt_id>.json where <attempt_id> is the id of the attempt associated with the recording.

In this model, user-categories will be omitted in favor of custom events.

Alternative Implementation

The file system would look like the following

/SpeedcuberTimer
|-- <event>
|   `-- index.json
|   `-- <category>
|       `-- times.csv
|       `-- <attempt_1 id>.json
|       `-- <attempt_2 id>.json
Full Example: ``` /SpeedcuberTimer |-- 222 | `-- index.json | `-- default | `-- times.csv | `-- afd2d2b8-8141-446a-a9c1-a800912694f5.json | `-- 744af4dc-45c3-4da8-b315-98da54de9369.json | `-- c177b5f1-cf8d-4355-a41e-0180634b06b8 | `-- times.csv | `-- 20b616c7-6a54-479c-b4f6-4e410440e641.json | `-- df2ac54e-6f5d-499f-b390-f6c9bf86e7cd.json |-- 333 | `-- index.json | `-- default | `-- times.csv | `-- 338b9b15-d6b1-4ea8-a99a-dd60934fa59a.json | `-- 9e649041-947a-4711-9c8e-5582a7c1d599.json | `-- 776a1688-57fe-4d2f-947f-3b02f44b17f1 | `-- times.csv | `-- d186a97a-6d65-4611-83d2-f8ce4cb7696a.json | `-- fac2c572-6e7c-4fd9-9216-0bb24039f302.json |-- 333oh | `-- index.json | `-- default | `-- times.csv | `-- 81402c99-4c97-4f44-9f29-21e617199737.json | `-- 6e77c9d5-50cc-42df-aa7d-8b006715bf28.json ```

The app will have a dedicated folder on the file system (the "Root Folder"). Within that folder will be one sub-folder for each event (an "Event Folder"). Within each Event Folder will be an index.json file alongside one sub-folder for each category of solves (a "Category Folder").

The index.json will contain a list of objects mapping the custom names of each user category to a specific folder in the file system. This choice will make it easier to support a Rename Category feature.

[
  {
    "category": "Default",
    "folder": "default"
  },
  {
    "category": "4x4 Practice",
    "folder": "776a1688-57fe-4d2f-947f-3b02f44b17f1"
  }
]

Each Category Folder will contain a times.csv file listing a few key fields for each attempt in that Event/Category. Additionally, the full details of each attempt will be stored in a collection of .json files, named by the attempt id.

The times.csv file will look like this:

id,timestamp,durationMillis,penalties,checksum
<uuid>,1234567890,14632,<bar `|` separated penalty strings>,<md5 hash of attempt data>
Full Example: ```csv id,timestamp,durationMillis,penalties 3b05329b-6e5d-41d2-8241-08adbae61146,1234567000,14632,, c5fa1673-c4dd-4a00-bce8-b995ddd86f73,1234568000,13421,+2, 5e33fd46-3c02-4e80-99e8-1fea76c9cf55,1234569000,15612,DNF, ae6d883f-771a-4c06-95e3-9b0a8147fa6f,1234570000,11008,DNS, 329ce1a0-b85e-4258-a309-bfb0af2b36bc,1234571000,12514,+2|+2, ```

Potential Alternatives

1) Store the STIF Attempt data in-line with the CSV (e.g. encoded as base64)

Pros

Cons

thehale commented 1 year ago

Also consider adding an index.json to each folder listing all the known CSVs for the event and a checksum to enable detection of unexpected changes to the attempt data.

EDIT: index.json has been added to the specification. Checksums have been added directly to the times.csv.

thehale commented 1 year ago

Consider adding some extra columns to the CSV to cache relevant statistics (Ao5, Ao12, Ao50, Ao100, best, worst) similar to how a bank ledger stores the current balance alongside each transaction.

thehale commented 1 year ago

The implementation did not use the directory structure specified in this ticket. Instead, a large JSON list is read/written to the file system for every change to the attempt library.

Once the file reaches 0.5 MB (as little as a few dozen smart solves with gyroscope data) the application begins to behave strangely -- missing animations, slower frame refreshes, etc.

We need reduce the amount of file system writing to help improve the performance.

thehale commented 1 year ago

New Proposal

Background

The most common operations on a database of attempts are:

Other datapoints, like the scramble, move count, etc. are much less important -- typically only viewed when opening an "attempt details" view. The solve recording, if it exists, is rarely accessed -- only whenever it's needed to re-compute a reconstruction.

We can optimize for these use cases by providing fast access to the date and result of each attempt then follow up with subsequent queries for more information.

Specification

/SpeedcuberTimer
|-- <event>
|   `-- results.csv
|   `-- details/
|       `-- <attempt_1 id>.json
|       `-- <attempt_2 id>.json
|   `-- recordings/
|       `-- <solution_1 id>.json
|       `-- <solution_2 id>.json

results.csv

In the format of id,inspectionStart,result

ee1e0de7-f729-451e-a9e1-88d81344541f,1693454067123,13456
4ae2e55c-c3f7-4e11-bd12-0537ea63533d,1693454245123,DNF

If we (safely) assume that each attempt takes less than 27.777... hours (i.e. 8 digits of milliseconds), then we can guarantee an upper bound of 60 characters on the line length.

So, every 1,000 solves will require 60kb of storage in this root csv. For comparison, a Twisty Timer export of 8728 3x3x3 solves uses 818.4kb, or 93.76kb per 1,000 solves (the extra space is used for quotes, a more verbose timestamp, and the scramble).

Saving new solves would only need to append to the file, as opposed to the current implementation which re-writes the entire file for every change to any attempt. Only edits to a historical attempt would require re-writing the file.

Given previous performance measures, this suggests that the app would only begin to exhibit sluggish behavior if it were constantly making changes to random solves' key data on datasets of greater than 5k attempts. Standard usage will exhibit no such sluggishness.

details folder

The full STIF JSON payload of each attempt will be saved in a details folder as a standalone file of the name attempt_id.json. Since the uuid id is stored in summaries.csv this means any attempt's details can be queried in O(1) time with respect to the number of attempts.

Writes (e.g. editing a comment) can also be completed in O(1).

recordings folder

Same logic as the details folder, but for solve-specific recordings.

Tradeoffs

By making attempt details only available after an extra file system read, batch statistics on other attempt data points (e.g. average inspection time, move count, etc.) will become more expensive unless included in the results.csv at the cost of more storage. This could be mitigated with additional indices at the cost of extra complexity when writing changes to an attempt.