Bug (Maybe?): `CodeSplitter` with large files does not actually semantically chunk.

suptejas commented 2 weeks ago

Hey Ben,

First off, thanks for your work on this incredible library. It's enabled us to achieve substantially better embedding results in our search pipeline.

When trying to embed a file, say the below one:

use miette::Result;
use std::env;
use std::path::{Path, PathBuf};

pub mod models;
pub mod validator;

fn main() -> Result<()> {
    // Retrieve command-line arguments, skipping the first one (program name)
    let changed_files: Vec<String> = env::args().skip(1).collect();

    // Process each file path provided
    for file_path in changed_files {
        let path = Path::new(&file_path);

        // Validate the directory structure
        if let Some(model_or_provider_dir) = path.parent() {
            if let Some(org_or_providers_dir) = model_or_provider_dir.parent() {
                if let Some(data_dir) = org_or_providers_dir.parent() {
                    validate_data_directory(
                        data_dir.to_path_buf(),
                        org_or_providers_dir.to_path_buf(),
                        model_or_provider_dir.to_path_buf(),
                        path,
                    )?;
                } else {
                    return Err(miette::Error::msg(format!(
                        "Missing 'data' directory for file: {}",
                        path.display()
                    )));
                }
            } else {
                return Err(miette::Error::msg(format!(
                    "Missing organizations or providers directory for file: {}",
                    path.display()
                )));
            }
        } else {
            return Err(miette::Error::msg(format!(
                "File does not have a parent directory: {}",
                path.display()
            )));
        }
    }

    println!("PASS");
    Ok(())
}

// Validate the data directory and its subdirectories
fn validate_data_directory(
    mut data_dir: PathBuf,
    org_or_providers_dir: PathBuf,
    model_or_provider_dir: PathBuf,
    path: &Path,
) -> Result<()> {
    let num_components = path.components().count();

    if num_components == 4 {
        data_dir = org_or_providers_dir.clone();
    }

    if data_dir.starts_with("data") {
        match data_dir.file_name().unwrap().to_str().unwrap() {
            "organizations" => {
                if num_components == 4 {
                    validator::directory::validate_organization_dir(&model_or_provider_dir)?;
                    validator::file::validate_organization_file(path)?;
                } else if num_components == 5 {
                    validator::directory::validate_model_dir(&model_or_provider_dir)?;
                    validator::file::validate_model_file(&path.to_path_buf())?;
                }
            }
            "providers" => {
                validator::directory::validate_provider_dir(&model_or_provider_dir)?;
                validator::file::validate_provider_file(path)?;
            }
            _ => {
                return Err(miette::Error::msg(format!(
                    "Unexpected directory: {}, in file: {}",
                    org_or_providers_dir.display(),
                    path.display()
                )));
            }
        }
    } else {
        return Err(miette::Error::msg(format!(
            "Expected 'data' directory, found: {}, in file: {}",
            data_dir.display(),
            path.display()
        )));
    }

    Ok(())
}

When using CodeSplitter as follows:

let splitter = CodeSplitter::new(
                        tree_sitter_rust::language(),
                        ChunkConfig::new(8192)
                            .with_sizer(cl100k_base().unwrap()),
                    )
                    .unwrap();

let generated_chunks = splitter
                        .chunks(&contents)
                        .collect::<Vec<&str>>()
                        .iter()
                        .map(|s| s.to_string())
                        .collect();

I get the following output:

"chunks": [
            "use miette::Result;\nuse std::env;\nuse std::path::{Path, PathBuf};\n\npub mod models;\npub mod validator;\n\nfn main() -> Result<()> {\n    // Retrieve command-line arguments, skipping the first one (program name)\n    let changed_files: Vec<String> = env::args().skip(1).collect();\n\n    // Process each file path provided\n    for file_path in changed_files {\n        let path = Path::new(&file_path);\n\n        // Validate the directory structure\n        if let Some(model_or_provider_dir) = path.parent() {\n            if let Some(org_or_providers_dir) = model_or_provider_dir.parent() {\n                if let Some(data_dir) = org_or_providers_dir.parent() {\n                    validate_data_directory(\n                        data_dir.to_path_buf(),\n                        org_or_providers_dir.to_path_buf(),\n                        model_or_provider_dir.to_path_buf(),\n                        path,\n                    )?;\n                } else {\n                    return Err(miette::Error::msg(format!(\n                        \"Missing 'data' directory for file: {}\",\n                        path.display()\n                    )));\n                }\n            } else {\n                return Err(miette::Error::msg(format!(\n                    \"Missing organizations or providers directory for file: {}\",\n                    path.display()\n                )));\n            }\n        } else {\n            return Err(miette::Error::msg(format!(\n                \"File does not have a parent directory: {}\",\n                path.display()\n            )));\n        }\n    }\n\n    println!(\"PASS\");\n    Ok(())\n}\n\n// Validate the data directory and its subdirectories\nfn validate_data_directory(\n    mut data_dir: PathBuf,\n    org_or_providers_dir: PathBuf,\n    model_or_provider_dir: PathBuf,\n    path: &Path,\n) -> Result<()> {\n    let num_components = path.components().count();\n\n    if num_components == 4 {\n        data_dir = org_or_providers_dir.clone();\n    }\n\n    if data_dir.starts_with(\"data\") {\n        match data_dir.file_name().unwrap().to_str().unwrap() {\n            \"organizations\" => {\n                if num_components == 4 {\n                    validator::directory::validate_organization_dir(&model_or_provider_dir)?;\n                    validator::file::validate_organization_file(path)?;\n                } else if num_components == 5 {\n                    validator::directory::validate_model_dir(&model_or_provider_dir)?;\n                    validator::file::validate_model_file(&path.to_path_buf())?;\n                }\n            }\n            \"providers\" => {\n                validator::directory::validate_provider_dir(&model_or_provider_dir)?;\n                validator::file::validate_provider_file(path)?;\n            }\n            _ => {\n                return Err(miette::Error::msg(format!(\n                    \"Unexpected directory: {}, in file: {}\",\n                    org_or_providers_dir.display(),\n                    path.display()\n                )));\n            }\n        }\n    } else {\n        return Err(miette::Error::msg(format!(\n            \"Expected 'data' directory, found: {}, in file: {}\",\n            data_dir.display(),\n            path.display()\n        )));\n    }\n\n    Ok(())\n}"
        ]

Is this expected? It works as it should for smaller chunk sizes like 512 and 1024, but doesn't seem to do any sort of chunking for a larger chunking token limit.

As you can see, it's not actually chunking the code in this case, it's just basically returning back the entire file's contents. I would somewhat expect the output to still be separating functions, imports, etc, but I'm not sure if that's the same goal you had in mind.

I'd really appreciate clarification on the same. Thanks so much.

benbrandt commented 2 weeks ago

Hi @suptejas it makes me really happy to hear that this crate has been helpful for you! Thanks for reaching out.

I haven't tested this myself, but I think it is highly likely with a chunk size of 8192 and the gpt tokenizer that this entire code file would fit inside that.

It is important to note that this is a greedy algorithm by default, it tries to pack as many elements as it can while in the chunk size. Maybe you were expecting similar functionality to this user? https://github.com/benbrandt/text-splitter/discussions/226 Where it would return all top-level items and only split if they are too big?

You might be able to do something similar by using the range syntax with a lower desired and a high max size (something like 256..8192 which might return similar results.. But I may need to have a way to not be greedy by default.

Feel free to let me know more of what you are looking for and I can see if I can accommodate, because I think your use case is a likely one. Thanks again!

suptejas commented 2 weeks ago

Thanks for the quick response! That's right, my expectation was to receive top-level items chunked like functions, classes etc. since I felt that would be helpful to have more semantically concise information passing into the embeddings model.

suptejas commented 2 weeks ago

I'm also currently working on benchmarking code search results w/ different embedding methods, so if it's useful I'm happy to report back on which method (greedy or semantic splitting) works better for a code search use case (which is probably one of the most common ones for using CodeSplitter I presume).

benbrandt commented 2 weeks ago

Ok this was the second request for it in a short time, so I think I need to find a better way to support this flow :)

And I would love to hear any results you want to share for your use case and what you find to be beneficial, and if there is anything I can do with this crate to help support what you are trying to do.

Thanks again!

benbrandt / text-splitter

Bug (Maybe?): `CodeSplitter` with large files does not actually semantically chunk. #236