crewAIInc / crewAI

Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.
https://crewai.com
MIT License
20.59k stars 2.85k forks source link

[BUG] CrewAI telemetry breaks EU data locality #1178

Closed pjaol closed 1 month ago

pjaol commented 2 months ago

During a review of CrewAI, we identified issues regarding the telemetry data collection and transfer processes, which may not fully comply with GDPR requirements.

Telemetry Data Transfer:

Telemetry data is currently being collected and transferred to https://telemetry.crewai.com, a location outside the EU. While CrewAI has made an effort to categorize the telemetry data and provide transparency in their documentation, several potential issues under GDPR still need to be addressed.

Potential GDPR Concerns:

  1. Implicit Personal Data Collection: CrewAI's documentation claims that no data concerning prompts, task descriptions, or agent backstories is collected unless the share_crew feature is enabled. However, even without this feature, the collected data could still potentially be considered personal if linked to identifiable individuals or actions.

Example: Desktop Implementations: User interactions could potentially be linked to a user through IP addresses or other unique identifiers, making the data personal under GDPR, even if the share_crew flag is not enabled.

  1. Lack of Explicit Consent: While CrewAI's documentation outlines the data collected and categorizes it to some extent, there is still a lack of explicit consent mechanisms for users.

Issues Identified:

  1. Data Transfer to the US: The telemetry data is transferred to the US, which is outside the EU. GDPR requires that such transfers are subject to strict safeguards to ensure data protection.

Recommendations for Improvement:

  1. Data Review and Classification:

    • Assess Data: Reevaluate whether any of the collected telemetry data could be classified as personal data under GDPR, particularly considering indirect identifiers like IP addresses.
    • Clarify Data Collection: Ensure that the data categories are clearly defined and that there is no ambiguity regarding what data is collected with or without the share_crew feature.
  2. Explicit Consent Mechanism:

    • Implement Opt-In Consent: Introduce a clear, explicit opt-in mechanism for all telemetry data collection, especially for users within the EU.
    • Consent Management: Provide users with a straightforward way to manage their consent, including the ability to opt out or revoke consent at any time.
  3. Transparency and Documentation:

    • Update Documentation: Revise the documentation to explicitly state where the data is stored, how users can access or delete their data, and how consent is managed.
    • Clarify Opt-Out Options: As the documentation states that an opt-out mechanism will be provided in the future, prioritize the development of this feature to comply with GDPR requirements.
  4. Legal Safeguards for Data Transfer:

    • Review Data Transfer Protocols: Ensure that any transfer of telemetry data to the US complies with GDPR by implementing appropriate safeguards such as Standard Contractual Clauses (SCCs) or reliance on an adequacy decision like the EU-US Data
    • Privacy Framework. Conduct a Data Protection Impact Assessment (DPIA): Given the scope of telemetry data collection, consider conducting a DPIA to evaluate and mitigate any risks associated with the current practices.

Related to :

1177 #266 #372 #241

joaomdmoura commented 2 months ago

@pjaol Great issue / proposed solutions! I've also sent this to our legal council to help us navigate it and make sure we are complying to any requirements, I'm expecting to hear back from them in the next 7 - 15 days so we can prioritize any necessary work :)

pjaol commented 2 months ago

Great @joaomdmoura - appreciate it.

There is a significant number of people from the issues linked, and it looks like from the discord who believe telemetry should be opt-in. That's your call, in comparison to several other products I encourage you to offer similar options to disable all telemetry.

Chroma https://docs.trychroma.com/telemetry ANONYMIZED_TELEMETRY = FALSE

Langchain https://docs.smith.langchain.com/old/tracing/quick_start LANGCHAIN_TRACING_V2 = False

I'm scheduled to open a CVE on it next week, but will hold off until August 30th

joaomdmoura commented 2 months ago

Thanks for pointing it out, I'm waiting from legal to understand better actual requirements given it's opensource, so we do the right thing.

You do can disable telemetry OTEL_SDK_DISABLED=true, so maybe we just need to better document it. :)

Just so I understand is the idea behind the CVE about disabling telemetry or #1177 ? I assume it's only #1177 but want to double check so I pass all correct information along

Thanks again, appreciate the work

pjaol commented 2 months ago

The CVE would be both, although I know I'll be asked to probably split it into two.

As a client IP is defined as a personal identifier, I don't see any way around that other than having the ability to disable telemetry and is in active breach.

OTEL_SDK_DISABLED=True 

I'll do some testing with that, it looks like it returns a NoOpTracer from open telemetry - but definitely that needs to be clear and unambiguous in documentation

1177 Should just be the removal of the BASE_URL from allowed sensitive data.

It could contain all or any of https://user:password@secret_project.somewhere.com:12345/next_iphone_model_ai_v0.1

Appreciate you looking at this!

joaomdmoura commented 2 months ago

As a client IP is defined as a personal identifier, I don't see any way around that other than having the ability to disable telemetry and is in active breach.

Once we remove the BASE_URL though I think there wont be any IP being collected, right? So there would be no breach? I think all the other data points are generic. Will also ask council about this specifically and send them our docs with the list of collected data points, but thanks for going over it with me!

joaomdmoura commented 2 months ago

Oh now that I think about it people could add IP into a model name even if doesn't include the url. But yup, given we already offer a way to disable it I think it's just a matter of better docs.

pjaol commented 2 months ago

I think there's a couple of issues

  1. Documenting the explicit ability to turn off telemetry , currently the wording is "We don't offer a way to disable it now, but we will in the future." So obviously change that, providing

    OTEL_SDK_DISABLED=true
  2. The order of documenting what's collected, right now stating we're not collecting private data unless... and here's the data we are collecting is definitely ambiguous. If you are using US council they will tell you it's the implementors responsibility to read the whole document and understand it. EU lawyers will tell you that's why the wording used is "unambiguous". It's fun being on those calls with specialized outside council

Just my opinion but a simple table of default collected data, optionally shared data makes it explicit and clear, including things like the output and if human input is included.

Defaulted Data Reason
Yes Version of CrewAI Assessing the adoption rate of our latest version helps us understand user needs and guide our updates.
Yes Python Version Identifying the Python versions our users operate with assists in prioritizing our support efforts for these versions.
Yes General OS Information Details like the number of CPUs and the operating system type (macOS, Windows, Linux) enable us to focus our development on the most used operating systems.
Yes Number of Agents and Tasks in a Crew Ensures our internal testing mirrors real-world scenarios, helping us guide users towards best practices.
Yes Crew Process Utilization Understanding how crews are utilized aids in directing our development focus.
Yes Memory and Delegation Use by Agents Insights into how these features are used help evaluate their effectiveness and future development.
Yes Task Execution Mode Knowing whether tasks are executed in parallel or sequentially influences our emphasis on enhancing parallel execution capabilities.
Yes Language Model Utilization Supports our goal to improve support for the most popular languages among our users.
Yes Roles of Agents within a Crew Understanding the various roles agents play aids in crafting better tools, integrations, and examples.
Yes Tool Usage Identifying which tools are most frequently used allows us to prioritize improvements in those areas.
No Goal (Opt-In) Part of detailed crew and task execution data, enabling deeper insight into usage patterns.
No Backstory (Opt-In) Part of detailed crew and task execution data, providing context for task execution and improving user experience.
No Context (Opt-In) Part of detailed crew and task execution data, essential for understanding how tasks are set up and executed.
No Output (Opt-In) Part of detailed crew and task execution data, offering insights into the final results of task execution.
No Human Input (Opt-In) Captures whether human input was required during task execution, helping to improve human-agent interaction mechanisms.
No Agent Verbosity (Opt-In) Indicates whether agents were set to verbose mode, providing insights into detailed logging and communication preferences.
No Max Iterations (Opt-In) Records the maximum number of iterations allowed for agents, aiding in the analysis of task complexity and agent efficiency.
No Max RPM (Opt-In) Tracks the maximum RPM (Requests Per Minute) settings, useful for understanding performance constraints and resource allocation.
No Tools Names (Opt-In) Identifies the specific tools used by agents during tasks, helping to prioritize tool development and support.
No Tool Reuse (Opt-In) Logs repeated usage of tools by agents, helping to identify potential areas for tool optimization or additional support.
No Task Description (Opt-In) Logs the description of each task, providing context for understanding task objectives and expected outcomes.
No Expected Output (Opt-In) Captures the expected output for tasks, useful for comparing against actual outcomes and measuring task success.
No Task Output (Opt-In) Records the actual output of tasks, enabling the assessment of task completion and quality of results.
  1. There still the issue of the ability to support querying and deletion of data collected and locality. As your not explicitly targeting EU users, you have exemptions but anyone implementing the software for an EU market doesn't. GDPR is always in flux, so I recommend that you include a note to developers to disable telemetry to be GDPR compliant.
joaomdmoura commented 2 months ago

Commits added, they are going out on the next version, probably cutting later today or over the weekend.

After clearing with legal we:

pjaol commented 2 months ago

This is great really appreciate it! The last bullet point

Confirmed that data localization is not a problem due to the fact this is not considered personal information under GDPR

As long as IP's are not retained then you should be clear on PII Consumers just need to be informed about locality, that can be a note in the docs.

joaomdmoura commented 1 month ago

Docs updated, new verison with not llm url cut :)