Open tmbrbr opened 6 months ago
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 7 committers have signed the CLA.
:white_check_mark: leeN
:white_check_mark: tmbrbr
:x: Samuel Groß
:x: alexbara2000
:x: 0drai
:x: LukasHock
:x: saelo
As the PR combines several things (that are intertwined) in various stages of being complete, I was wondering the following:
As a transitional phase, we could consider enabling the primitive tainting aspect selectively, either via a preference or a mozconfig setting. This would incur additional testing costs as this is a change set that touches a large part of the tainting implementation, but our testing setup seems to be in a reasonable shape.
Especially how we build/combine taint flows for primitive taints and what to include (i.e., by not having all operations we lose a lot of information but tracking them blows up the runtime/memory cost for e.g., hash functions) seems to be something that is somewhat unclear on how it is done optimally.
So this PR contains some easy wins which would be useful right away (Adding sources via IDL attributes) and some parts which require more work. As splitting this PR in two is probably significant effort, maybe that would be a compromise?
@leeN, not a bad idea to split the PR into two parts, especially the IDL taint source attributes piece. The only tricky part will be decoupling the two, as we developed them in parallel and the commits are intertwined.
I think most of the generator modification was done in 9f242f65fcaa5c09dbb6d89fdfe44f5c80526fa7, but there may be more that needs extracting.
On the other hand, the PR in its current state compiles and runs. Perhaps we can keep the fingerprinting sources disabled by default.
I noticed another thing during debugging #213. Currently, we can't disable sources related to primitive tainting in about:config.
So, as an update/status recap from my point of view:
Playwright integration seems to be fixed due to af544cf.
Memory Leaks seem to be resolved due to 301b663
See whether existing sources can be moved to the webidl generator
I looked at adding more sources/sinks to the WebIDL generator. My first impression was this would be a fairly involved change that certainly does not work for every operation. I will investigate further to check for cost/benefit tradeoffs.
Enable generated source enable/disable via about:config options
This works as is; it is just user-unfriendly in its current form. This is fixed once #2 lands.
So the major remaining points seem to be:
I think we can get a good grasp by running a) the benchmarks (e.g., Ares 6) that mach supports and perform the same tests we did for FP tracer. There, the primitive tainting was fairly fast (<15% overhead on page load times)
Reenable the JIT. I think this would generally be beneficial, although we would probably have to disable a bunch of sources by default; otherwise, we gain fairly little, as my impression is that, at some point, every script has tainted values. That's something we might want to discuss @tmbrbr; I think having too many sources/sinks active by default overwhelms the user and makes it difficult to store all the flows. With #2, we can make them directly present in the config, and the user can tailor her experiment to it. I will add an example of how to configure this to the Wiki.
Rework the taint flow storage/representation We had a chat about this at PETS. I spoke with Simon regarding storing taint flows in a graph database, and I was left with the impression that our use case seems not to fit perfectly well (millions of small graphs vs. large graphs) and that the question of selecting a graph database for this use case is fraught with issues. Say neo4j, the best-supported graph db according to my understanding, is fairly restrictive regarding features in the free version. The product brief is all over the place, but it seems like it can only use up to 4 CPU cores. While academic licenses are supposedly available, this probably does not help SAP/industrial users, and the getting Foxhound up and running hurdle is already fairly high. Adding "buy expensive storage engine" as step 2 seems like we'd hinder adoption.
I will look further, but I believe it might be best to decouple this change from this PR. Reworking the flow representation has benefits that go beyond primitive tainting. I certainly recall fighting this for the hand sanitizer study. As this change seems fairly disruptive (i.e., all tooling would require a significant rework), we probably have to get this right. What might be an option would be to store the flow internally as a directed graph and then offer the code to lineralize the flows to their current form. My impression is that the current Foxhound users mainly use Playwright (we can make an example available here) or extensions (same applies). This would preserve compatibility with existing tooling while offering the benefits of storing a graph instead of a bunch of vectors. However, this impression might be completely wrong, as I have limited insight into how others use Foxhound!
So, to sum up this point: I believe we should decouple the flow representation from this PR and offer to disable flow creation for primitive sources via the configuration, similar to how we allow disabling taint sources. This would allow us to merge this PR earlier and avoid divergence between mainline Foxhound and primitaint Foxhound.
I agree that we should avoid diverging the main and primitaint branches, the idea of this PR is to have something we can merge to main.
I would also be in favour of splitting up the work if possible. As you suggest I think a full blown move to graph based taint flows is going to be a big effort and should be handled in a separate PR (if at all...).
This PR enables tainting for numbers, and is a mergable version of the https://github.com/SAP/project-foxhound/tree/primitaint branch.
This PR adds the following features:
There is still some work to do before the merge:
a = b + c
- if b and c are tainted, how does the taint flow look for c?See whether existing sources can be moved to the webidl generator