CamFlow / camflow-dev

Generates kernel patch for CamFlow Linux Provenance Capture.
http://camflow.org/
GNU General Public License v2.0
27 stars 15 forks source link

Getting values for what has been read and written #98

Closed fabio-oesch closed 5 years ago

fabio-oesch commented 5 years ago

I'm trying to make a system where I can collect all the data for a workstation. Meaning I can reconstruct how a user might have accessed firefox and then entered a website. In general this works really nice with camflow.

I have one problem that I cannot figure out how I can get the values that have been read from files and written into them. I have tried reading the papers (ccs-2018 and socc-2017), but I could not come to any conclusion how to do this. In the socc-2017 paper (section 7.3) it is mentioned that we can prevent data loss with the collection of provenance. Since I am not managing to figure out how the values of each file is managed I am not fully understanding how this could be achieved. How could you restore a system if we do not know what has changed in each file?

I have also tried to understand how the logs could be used to reconstruct the system but it seems that neither w3c prov nor spade json had any information in them.

I might just be missing something obvious and I would very much appreciate if you could point me into the right direction.

michael-hahn commented 5 years ago

How could you restore a system if we do not know what has changed in each file?

The definition of "data loss prevention" is not to restore a file if the data in it was corrupted; instead, it means to prevent data exfiltration, or more precisely: Data Loss Prevention (DLP) is enterprise software that seeks to minimize the leakage of sensitive data by monitoring and controlling information flow in large, complex organizations. Take a quick look at Section 2.1 of this paper

How the logs could be used to reconstruct the system?

I am not sure what you mean by "reconstruct the system". Did you mean "provenance graph"? If you are asking how CamFlow reconstruct provenance graphs from the logs, then the answer is that CamFlow does not reconstruct the graph; instead, it directly outputs the graph based on our interpretation of the semantics of system operations that trigger the LSM hooks.

fabio-oesch commented 5 years ago

I must have misunderstood how this worked. What I expected DLP would be able to do was in a case of corruption of data to be able to get a "backup" of the data by following how the current data came to be. What I am trying to say is if it is possible to see what has been written in a file and was deleted from it.

If I understand correctly than we do not know from the provenance what has been written or read in a file. Because I am trying to get this data as well so I could track what a user wrote in a document and how a document came to existence. I assume that I would need to get this data from the sh_read and sh_write calls.

Thank you for your help. I hope this clears up what I am trying to accomplish.

tfjmp commented 5 years ago

Wikipedia gives a good definition of what is DLP: https://en.m.wikipedia.org/wiki/Data_loss_prevention_software

Sadly it does not quite align with your vision.

Re-shared memory it is more complicated than that as you would not have full mediation as with read/write-like data flows through system calls.

Do you want to get in touch via e-mail with a slightly more developed description of your vision and the context of your project? I am happy to see if I can help you get there.

fabio-oesch commented 5 years ago

That sounds good to me. I will contact you via e-mail. Thank you for taking your time.