abrignoni / iLEAPP

iOS Logs, Events, And Plist Parser
MIT License
756 stars 154 forks source link

Large Test Case Data Files Solution #910

Open JamesHabben opened 2 weeks ago

JamesHabben commented 2 weeks ago

Background

Our test data files fall into two categories:

  1. Regular Files (<100MB): Stay in git repository
  2. Large Files (>100MB): Need alternative storage solution

Example large file from testdata.webkit.json:

{
  "josh_ios15_ffs": {
    "artifacts": {
      "webkitCacheRecords": {
        "file_count": 4137,
        // Creates a 300MB zip file - too large for regular git
      }
    }
  }
}

Storage Options for Large Files (>100MB)

1. GitHub LFS

2. Azure Blob Storage

3. Google Drive

4. GitHub Releases

Questions for Discussion

  1. Is cost or automation more important?
  2. Do we anticipate many large test files?
  3. How important is version control for test data?
  4. Who will maintain the storage solution?
  5. Are there other cloud services to consider?
JamesHabben commented 2 weeks ago

a related underlying issue can be a subtopic but i think we will likely all agree. Brigs and I spoke about it in the video. we either cut down test data to be under size, or we figure out how to host full test data. example again for webkit. we could cut down the number of files included in the test case, but we would risk excluded different formats and whatnot in these files. i cant think of a sensible way to intelligently filter the number of files down without manual carve outs.

my thoughts:

  1. Github LFS i think we will exceed the storage quickly since a single webkit test case is 300 mb. we will also hit the bandwidth really quick as well since each time one of us grabs the file it counts. also each time a gh action grabs the file it counts. if we choose this route, we would likely need to get at least 1 data pack near immediate. are there other benefits to the project in having a data pack?
  2. Google Drive 15gb seems to be a reasonable size allowance, especially without bandwidth limits in addition. could be a new gmail account for each of the LEAPPs to spread the size data. Brigs can hold creds. can it generate an API key for us all to use? can we make a read only key?
  3. Azure Blob I have an azure account and use blob storage currently so i am familiar with it. i know API keys are not an issue to generate - multiples and read only just fine. i have also used gh actions directly with the blob storage api in my 4n6appfinder project repo. we may not need the direct integration though depending on how the integration of splitting the files goes as that might be conditions handled in python code being run by the action.
  4. Additional Thought do we consider splitting a 300 mb zip file into 3-4 segmented 95 mb zip files? could be a solution depending on how many of these large test cases we end up facing.