HolmesProcessing / Holmes-Storage

The Storage Planner manages access to all data within the Holmes Processing system. It orchestrates the interaction across multiple Databases, serves the files for analysis, etc.
16 stars 7 forks source link

object collection "schema" proposal #1

Closed cynexit closed 8 years ago

cynexit commented 9 years ago

Proposed schema:

{
   "md5":"098f6bcd4621d373cade4e832627b4f6",
   "sha1":"a94a8fe5ccb19ba61c4c0873d391e987982fbbd3",
   "sha256":"9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08",
   "submission":[
      {
         "user_id":1,
         "source":"x",
         "name":"test.exe",
         "date":"2015-11-24T15:23:29Z"
      },
      {
         "user_id":4,
         "source":"y",
         "name":"test23.exe",
         "date":"2015-11-24T19:23:29Z"
      }
   ]
}

Where we should use sha256 as the shard key.

cynexit commented 9 years ago

Update: Split it into two collections:

objects

{
   "md5":"098f6bcd4621d373cade4e832627b4f6",
   "sha1":"a94a8fe5ccb19ba61c4c0873d391e987982fbbd3",
   "sha256":"9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08"
}

submissions

{
    "sha256":"9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08",
    "user_id":1,
    "source":"x",
    "name":"test.exe",
    "date":"2015-11-24T15:23:29Z"
}

and have multiple submissions for each object.

webstergd commented 9 years ago

Do we want to add addition meta information such as the following?

object_reference - capture if it was a primary or secondary (dropped or carved) submission object_category - category the object would fall under object_type - MIME type [1]

[1] http://www.iana.org/assignments/media-types/media-types.xhtml

webstergd commented 8 years ago

proposed as final for review

New proposal for object as follows: { "_id": UUID, "sha1: str, "sha256": str, "md5": str, "mime": str, "source": []str, "obj_name": []str, "submissions" []UUID, }

New proposal for submission as follows: { "_id": UUID "object": UUID, "user_id": str, "source": str, "date": ISO8601, "obj_name": str, "tags": []str, "comment": str, }

Note: This scheme will cause extra processing time on the writes. However, it will decrease the number of queries on reads. This is beneficial with Mongodb but probably not needed on say cassandra.

webstergd commented 8 years ago

Final Scheme is as follows

Objects

{
"_id": UUID,
"sha1": str,
"sha256": str,
"md5": str,
"mime": str,
"source": []str,
"obj_name": []str,
"submissions": []UUID,
}

submission

{
"_id": UUID,
"object": UUID,
"user_id": str,
"source": str,
"date": ISO8601,
"obj_name": str,
"tags": []str,
"comment": str,
}