DormSoup / dormsoup-daemon

Parse the emails and add them to database
MIT License
2 stars 3 forks source link

Transfer to SIPB LLMs #9

Open almonds0166 opened 2 months ago

almonds0166 commented 2 months ago

MIT emails and dormspam event data is classified as medium risk information, and sending this information to proprietary services is a privacy concern at the least and violation of MIT policies (e.g., 11.0, 13.2) at worse. While dormspam is a large, decentralized email ecosystem that already lives on Microsoft services (Exchange/Outlook), privacy concerns remain valid.

To address these concerns, this open issue represents our transfer from ChatGPT to SIPB LLMs.

The process involves rewriting the backend's use of the OpenAI API to use the Fetch API at SIPB LLMs endpoints instead, e.g. our Mixtral model. JSON schema structuring is instead accomplished by GBNF grammars, already written.

almonds0166 commented 2 months ago

Grammar for detecting events (corresponds to HAS_EVENT_PREDICATE_FUNCTION schema):

const HAS_EVENT_PREDICATE_GRAMMAR = dedent`
   boolean ::= ("true" | "false") space
   char ::= [^"\\\\\\x7F\\x00-\\x1F] | [\\\\] (["\\\\bfnrt] | "u" [0-9a-fA-F]{4})
   has-event-kv ::= "\\"has_event\\"" space ":" space boolean
   has-event-rest ::= ( "," space rejected-reason-kv )?
   rejected-reason-kv ::= "\\"rejected_reason\\"" space ":" space string
   root ::= "{" space  (has-event-kv has-event-rest | rejected-reason-kv )? "}" space
   space ::= | " " | "\\n" [ \\t]{0,20}
   string ::= "\\"" char* "\\"" space`

Grammar for extracting the events (corresponds to EXTRACT_FUNCTION schema):

const EXTRACT_GRAMMAR = dedent`
   char ::= [^"\\\\\\x7F\\x00-\\x1F] | [\\\\] (["\\\\bfnrt] | "u" [0-9a-fA-F]{4})
   events ::= "[" space (events-item ("," space events-item)*)? "]" space
   events-item ::= "{" space  (events-item-title-kv events-item-title-rest | events-item-time-in-the-day-kv events-item-time-in-the-day-rest | events-item-date-time-kv events-item-date-time-rest | events-item-duration-kv events-item-duration-rest | events-item-location-kv events-item-location-rest | events-item-organizer-kv )? "}" space
   events-item-date-time-kv ::= "\\"date_time\\"" space ":" space string
   events-item-date-time-rest ::= ( "," space events-item-duration-kv )? events-item-duration-rest
   events-item-duration-kv ::= "\\"duration\\"" space ":" space integer
   events-item-duration-rest ::= ( "," space events-item-location-kv )? events-item-location-rest
   events-item-location-kv ::= "\\"location\\"" space ":" space string
   events-item-location-rest ::= ( "," space events-item-organizer-kv )?
   events-item-organizer-kv ::= "\\"organizer\\"" space ":" space string
   events-item-time-in-the-day-kv ::= "\\"time_in_the_day\\"" space ":" space string
   events-item-time-in-the-day-rest ::= ( "," space events-item-date-time-kv )? events-item-date-time-rest
   events-item-title-kv ::= "\\"title\\"" space ":" space string
   events-item-title-rest ::= ( "," space events-item-time-in-the-day-kv )? events-item-time-in-the-day-rest
   events-kv ::= "\\"events\\"" space ":" space events
   integer ::= ("-"? integral-part) space
   integral-part ::= [0] | [1-9] [0-9]{0,15}
   rejected-reason-kv ::= "\\"rejected_reason\\"" space ":" space string
   rejected-reason-rest ::= ( "," space events-kv )?
   root ::= "{" space  (rejected-reason-kv rejected-reason-rest | events-kv )? "}" space
   space ::= | " " | "\\n" [ \\t]{0,20}
   string ::= "\\"" char* "\\"" space`

Talking with the SIPB LLMs endpoints is the same process as it has been before (see talk.py). So for example, we could have:

async function doCompletion(prompt: string, grammar: string): Promise<string> {
   try {
      const response = await fetch(SIPB_LLMS_API_ENDPOINT, {
         method: "POST",
         headers: {
            "Authorization": `Bearer ${SIPB_LLMS_API_TOKEN}`,
            "Content-Type": `application/json`,
         },
         body: JSON.stringify({
            "messages": [
               {"role": "user", "content": prompt},
            ],
            "stream": false,
            "tokenize": true,
            "stop": ["</s>", "### User Message", "### Assistant", "### Prompt"],
            "cache_prompt": false,
            "frequency_penalty": 0,
            "grammar": grammar,
            "image_data": [],
            //"model": "mixtral",
            "min_p": 0.05,
            "mirostat": 0,
            "mirostat_eta": 0.1,
            "mirostat_tau": 5,
            //"n_predict": 1000,
            "n_probs": 0,
            "presence_penalty": 0,
            "repeat_last_n": 256,
            "repeat_penalty": 1.18,
            "seed": -1,
            "slot_id": -1,
            "temperature": 0.7,
            "tfs_z": 1,
            "top_k": 40,
            "top_p": 0.95,
            "typical_p": 1,
         }),
      });

      if (!response.ok)
         throw new Error(`HTTP error: ${response.status}`);

      const data = await response.json();
      return data["choices"][0]["message"]["content"];

   } catch (error) {
      console.error(`Error with completion:`, error);
      throw error;
   }
}

Those stop tokens may need workshopping. The relevant file is llm/emailToEvents.ts.

A difficulty with migrating to SIPB LLMs is that there is not yet an easy way to develop DormSoup locally, which may deserve it's own GitHub Issue. For starters, the most convenient approach may be to adapt the testEmailToEventsPrompt.ts script to read from a folder of plaintext emails (e.g., https://github.mit.edu/sipb/dormdigest-emails) instead of connecting to the inbox.

almonds0166 commented 1 month ago

Some progress on the sipb-llms branch: c2aa4d038ec43530cabe221640ce836e2cb9eb59

almonds0166 commented 1 month ago

Also e9d84cc3ebf340c991c38c417207ddd7fc28c526

Environment variables need to be added to .env: SIPB_LLMS_API_ENDPOINT, SIPB_LLMS_API_TOKEN