crowdin / context-harvester

A CLI for the extraction of contextual information for your keys using AI
MIT License
5 stars 2 forks source link
hacktoberfest

Crowdin Context Harvester CLI

This tool is especially useful when translating UI projects with Crowdin. The Context Harvester CLI is designed to simplify the process of extracting context for Crowdin strings from your code. Using Large Language Models (LLMs), it automatically analyzes your project code to find out how each key is used. This information is extremely useful for the human linguists or AI that will be translating your project keys, and is likely to improve the quality of the translation.

[![npm](https://img.shields.io/npm/v/crowdin-context-harvester?logo=npm&cacheSeconds=1800)](https://www.npmjs.com/package/crowdin-context-harvester) [![npm](https://img.shields.io/npm/dt/crowdin-context-harvester?logo=npm&cacheSeconds=1800)](https://www.npmjs.com/package/crowdin-context-harvester) [![npm](https://img.shields.io/github/license/crowdin/context-harvester?cacheSeconds=50000)](https://www.npmjs.com/package/crowdin-context-harvester)

Demo

Crowdin Context Harvester CLI Demo

Features

Installation

npm i -g crowdin-context-harvester

Configuration

Environment Variables

Set the following ENV variables for authentication:

If you prefer to use OpenAI to extract context you can set following variables:

If you prefer to use Google Gemini (Vertex AI API) to extract context you can set following variables:

If you prefer to use MS Azure OpenAI to extract context you can set following variables:

If you prefer to use Anthropic to extract context you can set following variables:

If you prefer to use Mistral to extract context you can set following variables:

Initial Setup

To configure the CLI, run:

crowdin-context-harvester configure

This command will guide you through setting up the necessary parameters for the harvest command.

Usage

After configuration, your command might look like this:

crowdin-context-harvester harvest\
    --token="<your-crowdin-token>"\
    --url="https://acme.api.crowdin.com"\ 
    --project=<project-id>\
    --ai="openai"\
    --openAiKey="<your-openai-token>"\
    --model="gpt-4o"\
    --localFiles="**/*.*"\
    --localIgnore="node_modules/**"\
    --crowdinFiles="*.json"\
    --screen="keys"\
    --output="csv"\
    --contextWindowSize="128000"\
    --maxOutputTokens="16384"

Note: The url argument is required for Crowdin Enterprise only. Passing all credentials as environment variables is recommended.

When this command is executed, the CLI will pull strings from all Crowdin files that match the --crowdinFiles glob pattern, then go through all files that match --localFiles, check if strings from Crowdin files are present in every file on your computer (because of the --screen="keys"), and if they are, both matching strings and the code files will be sent to LLM with a prompt to extract contextual information, information about how these strings are used in the code, how they appear to the end user in the UI, etc.

Extracted context will be saved to the csv file. Add the `--csvFile' argument to change the resulting csv file name.

You can now review the extracted context and save the CSV. After reviewing, you can upload newly added context to Crowdin by running:

crowdin-context-harvester upload -p <project-id> --csvFile=<csv-file-name>

Custom Prompt

Use a custom prompt with:

crowdin-context-harvester harvest ... arguments ... --promptFile="<path-to-custom-prompt>"

or

cat <path-to-custom-prompt> | crowdin-context-harvester harvest ... arguments ...

Example custom prompt file:

Extract the context for the following strings. 
Context is useful information for linguists working on these texts or for an AI that will translate them.
If none of the strings are relevant (neither keys nor strings are found in the code), do not provide context!
Please only look for exact matches of either a string text or a key in the code, do not try to guess the context!
Any context you provide should start with 'Used as...' or 'Appears as...'.
Always call the setContext function to return the context.

Strings:
%strings%

Code:
%code%

AI Providers

The CLI currently supports OpenAI, Google Gemini (Vertex AI), MS Azure OpenAI, Anthropic, and Mistral as AI providers. Provide required credentials or a Crowdin provider ID for context extraction. Consuming AI providers through Crowdin is useful for a quick start. Note, however, that in this case the code is uploaded to Crowdin before it is sent to the AI provider.

Handling Large Projects

For large projects, use the --screen option to filter keys or texts before sending them to the AI model:

crowdin-context-harvester harvest ... arguments ... --screen="keys"

Checking Context

The check command is designed to assess whether the strings in your Crowdin project have sufficient context for accurate translation. This process helps identify potential problems that may arise during translation, and ensures that translators have all the information they need to produce high-quality translations:

crowdin-context-harvester check \
    --token="<your-crowdin-token>" \
    --url="https://acme.api.crowdin.com" \
    --project=<project-id> \
    --ai="openai" \
    --openAiKey="<your-openai-token>" \
    --model="gpt-4o" \
    --contextWindowSize="128000" \
    --maxOutputTokens="16384" \
    --crowdinFiles="**/*.*" \
    --croql="<your-croql-query>" \
    --output="csv" \
    --csvFile="<path-to-your-csv-file>"

Customize options based on your specific needs and the AI provider you choose.

Removing AI Context

To remove previously added AI context, use the reset command:

crowdin-context-harvester reset

About Crowdin

Crowdin is a platform that helps you manage and translate content into different languages. Integrate Crowdin with your repo, CMS, or other systems. Source content is always up to date for your translators, and translated content is returned automatically.

License

The Crowdin Context Harvester CLI is licensed under the MIT License. 
See the LICENSE file distributed with this work for additional 
information regarding copyright ownership.

Except as contained in the LICENSE file, the name(s) of the above copyright
holders shall not be used in advertising or otherwise to promote the sale,
use or other dealings in this Software without prior written authorization.