Watts-Lab / atlas

The product of all our research cartography
https://atlas.seas.upenn.edu
GNU Affero General Public License v3.0
1 stars 1 forks source link

Feature generation UI and architecture #89

Open markwhiting opened 2 months ago

markwhiting commented 2 months ago

We have discussed a few options, but the best one for now is:

  1. Feature types are contributed open-source on Git Hub. These would be things like 'interface with GPT.' Feature types need to provide a few things, e.g., how to score features for them and a mechanism to contribute features with their type if possible.
  2. Features are things like "ask GPT how many subjects there were in the experiment." and are stored in DB, not in Git Hub.
  3. If feature types provide editable features, they should provide an interface for designing them based on the parameters of the features of that type, e.g., properties of the prompt.
  4. Feature types provide some metric for thinking about their performance

If a user wants to contribute a new feature type, they need to provide the back-end feature resolver as well as an interface to build features of that type.

If a user wants to contribute a new feature, they must select the type it uses and then leverage the provided interface to generate the feature they need.

markwhiting commented 3 weeks ago

We just talked about this more, and it sounds like we will go with the idea that there is one DB of features that has columns such as: feature (name), parent (name), provider (name), parameters (an object which conforms to a provider-specific protocol).

markwhiting commented 2 weeks ago

Here's a proposal o1-preview came up with:

Protocol for Feature Providers in the PDF Evaluation System

Overview

This protocol defines the standards and interfaces that Feature Providers must adhere to within our system. The system processes PDFs of academic papers and evaluates them using various techniques such as Large Language Models (LLMs), regular expressions (RegEx), external APIs, and human ratings. Each structurally different technique is considered a Feature Provider. Features derived from these providers are stored in a database and can be contributed by users.

Key Concepts

• Feature Provider: A modular component responsible for extracting or computing features from documents using a specific technique (e.g., LLMs, RegEx, APIs, human input). • Feature: A specific task or metric derived from a Feature Provider, such as “Determine the number of subjects in the experiment using GPT-4.” • Feature Type: The category or kind of features that a Feature Provider can produce. For example, an LLM-based provider might offer features related to text summarization, question-answering, etc.

Protocol Requirements

  1. Standard Interface Implementation

Each Feature Provider must implement a standard interface to ensure consistent interaction with the system. The interface includes the following methods:

• Initialization: Setup any necessary configurations.

def initialize(config: dict) -> None: pass

• Feature Definition: Provide a mechanism for defining new features based on provider-specific parameters.

def define_feature(parameters: dict) -> FeatureDefinition: pass

• Parameter Validation: Validate the parameters for a feature.

def validate_parameters(parameters: dict) -> Union[bool, str]: pass

• Feature Execution: Execute the feature on a given document.

def execute_feature(document: Document, parameters: dict) -> Any: pass

• Performance Metrics: Offer metrics to evaluate feature performance.

def get_performance_metrics() -> dict: pass

  1. Documentation

Feature Providers must include comprehensive documentation that covers:

• Usage instructions. • Supported features and their descriptions. • Accepted parameters and their formats. • Examples of feature definitions and executions.

  1. Data Structure Compliance

Inputs and outputs must conform to the system’s standard data structures:

• Document: A standardized representation of the PDF content. • Parameters: A JSON-serializable object specific to the provider. • FeatureResult: The output format of feature execution.

  1. Versioning

Include version information to manage updates and maintain backward compatibility.

version = "1.0.0"

  1. Security and Privacy

    • Securely handle API keys and sensitive information. • Comply with data privacy policies, especially when handling user data or interfacing with external services.

  2. Licensing

Provide clear licensing information to ensure compliance with open-source regulations.

  1. Contribution Guidelines for New Feature Providers

To contribute a new Feature Provider:

• Back-End Resolver: Submit the code responsible for feature computation. • Interface for Feature Design: Provide a user interface or API for defining new features. • Testing: Include test cases to validate functionality. • Documentation: As specified above. • Submission: Follow the project’s contribution guidelines and submit via GitHub.

Database Schema for Features

Features are stored in a centralized database with the following schema:

• feature_name: STRING — Unique identifier for the feature. • parent_name: STRING — Optional reference to a parent feature. • provider_name: STRING — Identifier of the Feature Provider. • parameters: JSON — Provider-specific parameters. • metadata: JSON — Optional additional information. • version: STRING — Feature version. • created_by: STRING — Contributor’s identifier. • created_at: TIMESTAMP — Creation timestamp. • updated_at: TIMESTAMP — Last update timestamp.

Process for Contributing New Features

Users can contribute new features by:

  1. Selecting a Feature Provider: Choose the appropriate provider for the desired feature.
  2. Defining the Feature: Use the provider’s interface to specify feature parameters.
  3. Validation: The provider validates the parameters.
  4. Submission: The new feature is stored in the database upon successful validation.

Example Implementation

Feature Provider: LLMFeatureProvider

Initialization

class LLMFeatureProvider: def initialize(self, config: dict) -> None: self.api_key = config.get("api_key") self.model = config.get("model", "gpt-4")

Define Feature

def define_feature(self, parameters: dict) -> FeatureDefinition:
    # Validate parameters
    if not self.validate_parameters(parameters):
        raise ValueError("Invalid parameters")
    return FeatureDefinition(
        name=parameters["feature_name"],
        provider="LLMFeatureProvider",
        parameters=parameters
    )

Validate Parameters

def validate_parameters(self, parameters: dict) -> bool:
    required_keys = ["prompt_template", "feature_name"]
    return all(key in parameters for key in required_keys)

Execute Feature

def execute_feature(self, document: Document, parameters: dict) -> Any:
    prompt = parameters["prompt_template"].format(document_text=document.text)
    response = call_llm_api(prompt, self.model, self.api_key)
    return response

Feature Definition in Database

• feature_name: “NumberOfSubjects” • provider_name: “LLMFeatureProvider” • parameters:

{ "prompt_template": "How many subjects were included in the experiment described in the following paper: {document_text}", "model": "gpt-4", "temperature": 0.7 }

Performance Metrics

Feature Providers should implement a method to return performance metrics, which could include:

• Execution Time: Average time to compute the feature. • Accuracy: For supervised features, the accuracy against labeled data. • Error Rates: Frequency of errors or failures.

Example:

def get_performance_metrics(self) -> dict:
    return {
        "average_execution_time": "500ms",
        "accuracy": "95%",
        "error_rate": "2%"
    }

Security Considerations

• API Keys: Should be stored securely and not hard-coded. • Data Privacy: Ensure that any data sent to external services is anonymized if necessary. • Compliance: Adhere to GDPR or other relevant regulations when handling personal data.

Conclusion

This protocol ensures that all Feature Providers are compatible with the system, secure, and maintainable. By standardizing interfaces and practices, we enable a collaborative environment where users can contribute new providers and features effectively.

Note: This protocol is subject to updates. Contributors are encouraged to check the latest version before implementing a new Feature Provider or feature.

I generally like it and it generally conforms to what I think we have talked about. To me the key bits we need to put in place are from section 7:

  1. Providers need to provide a Back-End Resolver — the core code that executes or calls external systems to execute features with that provider. Aggregation should happen at this level, so e.g., if we request 10 features from GPT4 they come over the resolver gets them as a set of 10 in one shot and has internal logic to decide how to split them.
  2. Providers need to provide an Interface for Feature Design — this could be as a component with a few fields and validation, or as sophisticated as a wizard to help define features for particular goals, but it needs to make any relevant parameters available for the user. We also need to allow for defining a name and other metadata and incrementing versions on features (though this can happen automatically)
  3. Providers need to provide some validation and fallback — this helps us understand how much we can trust a feature and what to do if the feature fails. I think this can be optional for now but should be strongly encouraged.

So on our end we need a place to store providers, and a way to render their interfaces, e.g., a place for that component to be rendered. I don't think we need the end to end design experience to be perfect yet, let's prioritize letting people add and refine features instead.