Closed SyGen899 closed 3 years ago
Hi, thanks for your attention. But I'm not very clear about your requirement. Can you please give more details?
Thanks for your response, so if im understanding this, it needs to run constantly to perform better at detecting anomalies on a network so e.g. stream data , gets a new edge and score it, then classify, but maybe I dont understand what if the system goes down that this is running on, is there a way to store what the algorithm has learned as a backup or something? I read about the Count min sketch, is this only created in memory and released if failure happens? or does this not matter?
I think you would want to periodically backup algorithm states and CMS states to a local file. The current implementation is rather a minimal version, so all things are in-memory, except for outputs. To backup, I think most variables are useful, except for a small index array that carries hashing results back from CMS.
Ok, I understand a lot better now, do you know of a way that I can maybe do this with the current implementation? how should I go about storing these states?
Like you can save the states to a local file (whatever format you prefer) every 10M edges. You don't need to modify the core, since those data structures only use public members. Just add a wrapper, like example/Demo.cpp
, and do your job.
Thanks. another question, how should a threshold be defined with this? is there an implementation that is available?
If you mean a threshold to decide whether an edge is anomalous, no, the algorithm only gives raw scores. But you can use a small sample of scores as the baseline.
Sorry, another question, but would this be effective in sampled NetFlow data, ie aggregate at n intervals?
Sorry I can't give a clear answer. Maybe you can try once and see if there's any problem.
Hi first off this is really cool, Im a novice coder and for research I would like to implement this on Netflow data in real time, the only thing is Im unsure how this can be integrated into a live environment and not on some local dataset, but maybe its a dumb question, but how should or could this be implemented?