astronomer / ask-astro

An end-to-end LLM reference implementation providing a Q&A interface for Airflow and Astronomer
https://ask.astronomer.io/
Apache License 2.0
196 stars 47 forks source link

Need to specify tokenization for content #109

Open mpgreg opened 1 year ago

mpgreg commented 1 year ago

https://github.com/astronomer/ask-astro/blob/c45487c7f12a9424dbe885580c687e35e30b7de4/airflow/include/data/schema.json#L54

Without specifying a tokenization scheme ingest will default to word as per https://weaviate.io/developers/weaviate/config-refs/schema#property-tokenization. This will split snake-case configuration parameters and environment variables treating underscore as whitespace.

Example as per https://github.com/weaviate/weaviate/blob/764935fe4b576c87750d6a16ea20fd6c349b20b8/adapters/repos/db/helpers/tokenizer.go#L67

func main() {
    in := "THIS is my_env_variable"

    fmt.Print("\nwhitespace")
    fmt.Print(tokenizeWhitespace(in))
    fmt.Print("\nlowercase")
    fmt.Print(tokenizeLowercase(in))
    fmt.Print("\nword")
    fmt.Print(tokenizeWord(in))
    fmt.Print("\nwildcards")
    fmt.Print(tokenizeWordWithWildcards(in))

}

Results in...

whitespace[THIS is my_env_variable]
lowercase[this is my_env_variable]
word[this is my env variable]
wildcards[this is my env variable]

To prevent splitting of snake-case words or to lose camel-case params we need to switch to whitespace.

shillion commented 11 months ago

@sunank200 — is this issue still relevant?