castedo / copyaid

Mirror of https://gitlab.com/castedo/copyaid
MIT License
0 stars 0 forks source link

FEATURE: copybreak with optional subtask #1

Closed castedo closed 7 months ago

castedo commented 8 months ago

The following is copied from https://gitlab.com/castedo/copyaid/-/issues/6

MOTIVATION

Mass testing with prompts indicates that quality of GPT output quality degrades as the inputs get longer and longer. It is also more expensive to send all text in a file. It is a bit of pain to have to break up documents into smaller files merely for the reason to have less text sent to OpenAI. It is also quite annoying to have OpenAI suggesting lots of changes to sections of text that have already been worked on when only other sections are in need of copyediting/proofreading. This feature is relatively simple to implement and provides users lots of flexibility to control behavior and mitigate these problems.

FEATURE

Allow specially marked lines to act as "copybreaks" within source text. These line are not included in OpenAI request text and instead force a break up of the source file into separate chunks that become part of separate prompt texts for the OpenAI API.

For markdown (.md) an example copybreak line is:

<!-- copybreak -->

and for LaTex (.tex) an example copybreak line is:

%% copybreak

The config file for Copyaid allows control of the exact line prefix per file type (based on file extension) and the keyword. For the above example, the config in TOML would be something like:

copybreak = {
    'md' = ['<!--', 'copybreak'],
    'tex' = ['%%', 'copybreak'],
}

Optionally a subtask name can follow the marking prefix, after 'copybreak' and whitespace. What prompts and requests are triggered, if any, given the subtask name is controlled from the config file. Some subtask names can be configured to skip being sent to OpenAPI and so that the chunk of text is left as is. For example:

<!-- copybreak skip -->

and

%% copybreak skip

will cause all further text to be skipped from being sent to OpenAI until a difference subtask name is encountered.

When no subtask name is specified, whatever was the last subtask name specified is used again. The configuration for a copyaid task can specify the initial subtask name to take effect. Some users might want it to be "on" and the skip subtask name to be "off".

During an initial experimental stage I plan to use "light" and "heavy" as subtask names corresponding to light/heavy copy-editing and will probably configure "skip" as the initial subtask.

RELATED

https://github.com/manubot/manubot-ai-editor/ automatically splits up files into "paragraphs" and sends them as separate chunks to OpenAI. I find the logic for parsing apart "paragraphs" too fragile, hard-coded, and error prone to be acceptable as a default for entire files. As a future feature, I imagine some CopyAId subtask names can enable similar automatic break up, but not by default. The automatic additional breaking would only happen because a particular subtask of a copybreak has as enabled it.

castedo commented 8 months ago

I am currently thinking to using "start" and "stop" as the pre-installed example subtask names.

castedo commented 8 months ago

Similar feature/format in vale.sh:

https://vale.sh/docs/topics/config/

<!-- vale off -->
<!-- vale on -->
castedo commented 8 months ago

I am currently thinking to using "start" and "stop" as the pre-installed example subtask names.

I worry "stop" implies the rest of the doc will not be processed. Better possibilities:

castedo commented 8 months ago

I'm thinking "instruction" is a better choice than "subtask". This feature does not need to be coupled with the "task" feature of CopyAId. The code implementing this could be used in a utility that has all off the CLI convenience features of CopyAId ripped out.

castedo commented 7 months ago

This feature has been implemented and released in v0.6 and v0.7 of Copyaid. Documentation at https://copyaid.it/copybreaks/

Related feature idea in the inspiration for copyaid is https://github.com/manubot/manubot-ai-editor/issues/32