asg017 / sqlite-xsv

the fastest CSV SQLite extension, written in Rust
Apache License 2.0
122 stars 2 forks source link

csv_reader declaration to include columns with spaces #6

Open opustecnica opened 1 year ago

opustecnica commented 1 year ago

While in agreement that spaces should never be used in a column name, at times, when dealing with externally sourced csvs, spaces are indeed present.

When attempting to create a virtual table that includes spaces, a 'vtable constructor failed: iavm_reader' error is returned.

e.g. CREATE VIRTUAL TABLE temp.iavm_reader using csv_reader(plugin, [plugin name])

Is there a way to work around this?

PS When creating a virtual table directly there are no issues with the header.

e.g. CREATE VIRTUAL TABLE temp.iavm_csv using csv(filename=iavm.csv)

asg017 commented 1 year ago

not yet, this is a limitation with the custom parser sqlite-xsv has that tries to parse column names + definitions on that case. It splits by word and doesn't take into account spaces/[] brackets/"" quotes, which the SQLite core library does.

But it should! Like you mentioned there are workaround if you use filename=, but that's not always possible. Will look into seeing how much effort it'll be to update that naive parser.

opustecnica commented 11 months ago

Alex,

To the best of my understanding, columns with (unfortunate) spaces can be handled in SQLite in 3 ways: .1 'Item Type' .2 "Item Type" .3 [Item Type] (this is my preferred way)

When creating a VT like:

CREATE VIRTUAL TABLE IF NOT EXISTS temp.xsv_5m_Sales_Records_Test USING csv(filename='D:\Temp\5m-Sales-Records.csv.zst', header=false,Region TEXT, Country TEXT, [Item Type] TEXT

it fails. When replacing "[]" with single quotes, the creation goes through but the results are:

Region Country Item
------ ------- ----
Europe Poland  Beverages
Europe Poland  Clothes

Using alternate delimiters like '`' (backtick) allows the table creation to complete but the column will assume the generic name of 'Column1'

Region Country Column1
------ ------- -------
Europe Poland  Beverages
Europe Poland  Clothes
Europe Poland  Cosmetics
Europe Poland  Snacks
opustecnica commented 11 months ago

Alex, if I understand correctly the code despite my turkey induced drowsiness, the column space issue should be resolved in the sqlite-loadable-rs (vtab_argparse.rs). This is an attempt. Let me know your thoughts.

pub fn arg_is_column_declaration(arg: &str) -> Result<Option<ColumnDeclaration>, String> {
    if arg.trim().is_empty() {
        return Ok(None);
    }

    // TODO this is a bit of a hack, but it works for now
    // NOTE Test if 'arg' contains characters that imply the column name has a space.
    let regex_check = Regex::new(r#"[\['"]"#).unwrap();
    // NOTE Assign an appropriate regex string based on the above check.
    let regex_string = if regex_check.is_match(arg) {
        r#"^[\['"](?<name>.*?)[\]'"]\s+(?<declared_type>.*?)\s+(?<constraints>.*)$"#
    } else {
        r#"^(?<name>.*?)\s+(?<declared_type>.*?)\s+(?<constraints>.*)$"#
    };
    // NOTE Compile the regex string into a regex object.
    let regex_result = Regex::new(regex_string);
    // NOTE Check if the regex compiled successfully and assign the results to variables.
    match regex_result {
        Ok(re) => match re.captures(arg) {
            Some(caps) => {
                /* 
               // TEST
                println!(
                    "Found match: Name: {0} - Declared Type: {1} - Constraints: {2}",
                    &caps["name"].trim(),
                    &caps["declared_type"].trim(),
                    &caps["constraints"].trim()
                );
                */
                // TODO Verify scope.
                name = &caps["name"].trim();
                declared_type = &caps["declared_type"].trim();
                constraints = &caps["constraints"].trim();
            }
            None => {
                println!("No match found!");
            }
        },
        Err(err) => {
            println!("Could not compile regex: {}", err);
        }
    }

    Ok(Some(ColumnDeclaration::new(
        name,
        declared_type,
        constraints,
    )))
}

Happy TG!