Closed Dentrax closed 5 months ago
Hey @Dentrax
I would love to see your project under the Falcosecurity organization :star_struck: And thank you for considering donating your project to this organization!
So big +1 from me!
Just one question: I see the falco-gpt
is MIT-licensed. Would it be ok for you to switch the license to Apache2? As per the CNCF IP Policy (section 11) all contributions must be made under Apache2, whereas third-party dependencies can match a more comprehensive set of allowed licenses
cc @falcosecurity/core-maintainers
+1 for me too.
@Dentrax Do you agree I become owner with you? I think, I'm the one who knows the better the code base with you.
+1
:+1: from me! Bleeding edge technologies melt together :rocket:
amazing work, thanks +1!
Hi @Dentrax thank you for all the hard work!
While privacy concerns are clearly stated in the current repos readme https://github.com/Dentrax/falco-gpt#disclaimer, I have concerns about sponsoring a tool under The Falco Project that suggests sending sensitive real-life production data to OpenAI. This is likely to go against the privacy policies of most adopters.
Therefore, my vote is a conditional +1 if we were to make significant adjustments to the project.
Instead of recommending making calls against the OpenAI API with real data, why don't we explore how far we can get by feeding synthetic data from our existing e2e tests? Do OpenAI's recommendations actually depend on the data fields, or do they only depend on the rule names or descriptions, given that it is a generic LLM?
The project could benefit from a clearer motivation and justification for the methodology chosen, as well as an expansion of its use cases and examples.
I would require at least a best effort attempt to perform quality control and model validation. For example, each existing upstream rule should be tested multiple times, and the incident response actions suggested by OpenAI should be deemed at least somewhat valid for real-life incident response actions. This is crucial because by promoting this project, we are indirectly approving its validity, even though OpenAI clearly states that data can be wrong.
Lastly, the term "AI + Falco" seems too far-fetched at the moment, as it could be misunderstood to mean that the Falco runtime tool now uses AI to generate detections. I would hold off on using this messaging until we actually do something like that.
Hey, thanks for the interest everyone!
Would it be ok for you to switch the license to Apache2? - @leogr
Sure, I just updated the license.
Do you agree I become owner with you? - @Issif
Definitely! It'd be great to collaborate since you are already familiar with the code base.
While privacy concerns are clearly stated in the current repos readme Dentrax/falco-gpt#disclaimer, I have concerns about sponsoring a tool under The Falco Project that suggests sending sensitive real-life production data to OpenAI. This is likely to go against the privacy policies of most adopters. - @incertum
Thanks for the reviews, Melissa. I didn't think of it from that point of view. This is really nice. I'd like to clarify you concerns as much as I can.
Instead of recommending making calls against the OpenAI API with real data, why don't we explore how far we can get by feeding synthetic data from our existing e2e tests?
It does make sense. We could create a big example-audit-log list to feed OpenAI in order to prevent sending real audit data. This can be enabled with a flag. But we should think carefully about how this dummy data will fit in with the real scenario.
Do OpenAI's recommendations actually depend on the data fields, or do they only depend on the rule names or descriptions, given that it is a generic LLM?
I'm not really sure. This would require a technical knowledge about how ChatGPT works under the hood. Basically, ChatGPT uses/puts that fields in the final output message to enrich the recommendation. Do you mean we should redact that?
I would require at least a best effort attempt to perform quality control and model validation.
Ah, yes. OpenAI could be wrong sometimes. Means that, the accurate of this project is as accurate as the how OpenAI is accurate.
For example, each existing upstream rule should be tested multiple times, and the incident response actions suggested by OpenAI should be deemed at least somewhat valid for real-life incident response actions. This is crucial because by promoting this project, we are indirectly approving its validity, even though OpenAI clearly states that data can be wrong.
This would be a challenging. Covering with the unit tests could also be misleading since responses of ChatGPT have temperature and leads inconsistencies after each run. (even you set it to 0
) Maybe we should write a "Risks & Mitigations" section in the README to indicate OpenAI sometimes could be wrong, and do not trust it. TBH, I have no idea about how should we tackle with that.
could create a big example-audit-log list to feed OpenAI
Yes, this would be great to get started. Before making changes and deciding on flags and other details, let's first experiment and see what we can find.
I'm not really sure. This would require a technical knowledge about how ChatGPT works under the hood. Basically, ChatGPT uses/puts that fields in the final output message to enrich the recommendation. Do you mean we should redact that?
We absolutely need to perform black-box testing, similar to how you find exploits and such. This means feeding in all sorts of example logs, from synthetic or complete logs to redacted ones. Afterwards, we need to manually inspect the answers and assess how useful the suggestions are, especially because it is a generic LLM and not particularly trained for IR and Falco purposes.
Ah, yes. OpenAI could be wrong sometimes. Means that, the accurate of this project is as accurate as the how OpenAI is accurate.
Related to the comment above, as project maintainers, we at least need to provide recommendations on how useful the model OpenAI outputs are at the moment. We also need to keep this guidance up-to-date.
Follow-up question: Have you been exposed to working on large incidents? I'd be happy to help with the assessment, as I've been around the block a bit in this regard.
Covering with the unit tests could also be misleading since responses of ChatGPT have temperature and leads inconsistencies after each run.
Happy to clarify. This comment was referring to manual assessment by experts after checking the results of our existing Falco rules e2e and unit tests.
In summary, my proposed next steps are:
Test some Falco inputs, share IR suggestions by OpenAI, then define next steps.
Love it @Dentrax! Thank you.
Thanks also @incertum for all the proposed detaild points to focus on. I agree with this:
Instead of recommending making calls against the OpenAI API with real data, why don't we explore how far we can get by feeding synthetic data from our existing e2e tests? Do OpenAI's recommendations actually depend on the data fields, or do they only depend on the rule names or descriptions, given that it is a generic LLM?
About this @incertum:
I would require at least a best effort attempt to perform quality control and model validation. For example, each existing upstream rule should be tested multiple times, and the incident response actions suggested by OpenAI should be deemed at least somewhat valid for real-life incident response actions. This is crucial because by promoting this project, we are indirectly approving its validity, even though OpenAI clearly states that data can be wrong.
I think it's important but maybe might not be required in the sandbox maturity level and in order to highlight the lack of all-official-core-rules testing, we could highlight that it's an experimental project and is not supposed to be used in production right now. In the meantime the validation could go on. What do you think?
I agree with the proposed next steps:
Test some Falco inputs, share IR suggestions by OpenAI, then define next steps.
I think it's important but maybe might not be required in the sandbox maturity level and in order to highlight the lack of all-official-core-rules testing, we could highlight that it's an experimental project and is not supposed to be used in production right now. In the meantime the validation could go on. What do you think?
I agree :+1:
@incertum, do you still have any concerns regarding accepting this sandbox request?
@incertum, do you still have any concerns regarding accepting this sandbox request?
@leogr yes my previous guidance and conditional +1 remains valid https://github.com/falcosecurity/evolution/issues/311#issuecomment-1707390167.
May I kindly ask what the challenges are regarding testing at least let's say 20-30 rules? Is no one else interested in at least verifying whether the IR suggestions are even remotely useful?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Provide feedback via https://github.com/falcosecurity/community. /close
@poiana: Closing this issue.
As you might see from the announcements ^1 that I've also introduced it in the community meeting ^3, I was created a new PoC tool called falco-gpt.
Repository: https://github.com/Dentrax/falco-gpt
Motivation
By transferring of the ownership of the project to Falco ecosystem, it would allow to project grow faster, much more efficiently and organized since I don't find much free time to maintain. By taking the advantage of the great community and maintainers, I believe we'd do much better.
Next steps would be: (some of my ideas)
Waiting for your feedback.