GSmithApps / csv-aligner

MIT License
1 stars 0 forks source link

Use a CSV parser #1

Closed GSmithApps closed 3 weeks ago

GSmithApps commented 1 month ago

This will fix the following:

GSmithApps commented 1 month ago

Yes, you can use papaparse to parse a CSV string in TypeScript. papaparse can also auto-detect the delimiter if you don't specify one. Here's how you can use it:

  1. First, install papaparse:
npm install papaparse
  1. Then, use it in your TypeScript code to parse the CSV string:
import Papa from 'papaparse';

function parseCSVWithPapa(csv: string): string[][] {
    const result = Papa.parse<string[]>(csv, {
        delimiter: "", // Leave as empty string for auto-detect
    });

    // `result.data` contains the parsed array of arrays
    return result.data;
}

// Example usage:
const csvString = `name,age,city
John Doe,29,New York
Jane Smith,34,Los Angeles`;

const parsedArray = parseCSVWithPapa(csvString);
console.log(parsedArray);

In this code:

papaparse is quite robust and can handle various complexities in CSV parsing, such as quoted fields, different line endings, and custom delimiters.

GSmithApps commented 3 weeks ago

Yes, when PapaParse infers the delimiter for a CSV file, it does provide the inferred delimiter in the result. The meta property of the result contains metadata about the parsing process, and one of the fields in meta is delimiter, which holds the inferred delimiter.

For example, after parsing a CSV file, you can access the inferred delimiter like this:

Papa.parse(file, {
    complete: function(results) {
        console.log("Inferred delimiter:", results.meta.delimiter);
    }
});

This will log the delimiter that PapaParse inferred during the parsing process.

GSmithApps commented 3 weeks ago

You're right! Delimiters inside quotes should not be treated as actual delimiters in CSV files. To handle this correctly, you'd need to account for the CSV parsing rules, where quoted sections are preserved as is, and delimiters inside them are ignored.

To address this, you can still use PapaParse to correctly parse the CSV, and then write logic to only record delimiter positions outside of quoted sections. Here's a solution that does this:

let csvContent = `name,age,city
"John, Doe",30,"New York"
Jane,25,"Los Angeles"`;

// Step 1: Infer the delimiter
let config = {
    delimiter: "", // Allows PapaParse to infer the delimiter
};

let result = Papa.parse(csvContent, config);
let delimiter = result.meta.delimiter; // Get the inferred delimiter

// Step 2: Find the positions of the delimiters (outside quotes)
let delimiterPositions = [];
let inQuotes = false;

for (let i = 0; i < csvContent.length; i++) {
    let char = csvContent[i];

    // Toggle inQuotes when encountering double quotes
    if (char === '"') {
        inQuotes = !inQuotes;
    }

    // If we're not inside quotes and we hit a delimiter, record its position
    if (char === delimiter && !inQuotes) {
        delimiterPositions.push(i);
    }
}

console.log("Delimiter Positions:", delimiterPositions);

How it works:

  1. Infer the delimiter using PapaParse.
  2. Track quote states using a boolean (inQuotes) to keep track of whether you're inside a quoted field.
  3. Loop through the content: When encountering a delimiter, it checks if you're outside of quoted sections before recording the position.

This should now handle delimiters correctly, ignoring those inside quoted fields.

Would you like any further adjustments to this solution?

GSmithApps commented 3 weeks ago

Got it! What you're asking for is to track the length of each value (including leading/trailing whitespace, quotes, etc.) between delimiters, while also handling delimiters inside quoted values properly.

Here's how we can achieve that:

  1. Parse the CSV with PapaParse (which correctly handles quoting).
  2. Reconstruct each value's length based on the raw text between delimiters.

This approach will involve:

Here's a solution that does this:

let csvContent = `name,age,city
"John, Doe", 30 ,"New York"
Jane, 25, "Los Angeles"`;

// Step 1: Parse the CSV to extract the values, and infer the delimiter
let config = {
    delimiter: "", // Allows PapaParse to infer the delimiter
    skipEmptyLines: false,
    keepEmptyRows: true, // If you want to keep empty rows
};

// Parse the CSV data
let result = Papa.parse(csvContent, config);
let delimiter = result.meta.delimiter;

// Step 2: Reconstruct raw text and capture value lengths
let valueLengths = [];
let rows = csvContent.split(/\r?\n/); // Split raw content by newlines

rows.forEach((row) => {
    let inQuotes = false;
    let currentValue = '';
    let valueStart = 0;
    let rowLengths = [];

    for (let i = 0; i < row.length; i++) {
        let char = row[i];

        // Toggle inQuotes when encountering double quotes
        if (char === '"') {
            inQuotes = !inQuotes;
        }

        // If we're not in quotes and hit a delimiter, calculate the value length
        if (char === delimiter && !inQuotes) {
            rowLengths.push(i - valueStart); // Length from start to current delimiter
            valueStart = i + 1; // Start of next value
        }
    }

    // After the last delimiter, push the final value's length
    rowLengths.push(row.length - valueStart);

    // Collect the row's value lengths
    valueLengths.push(rowLengths);
});

console.log("Delimiter:", delimiter);
console.log("Value lengths per row:", valueLengths);

How it works:

  1. Step 1: Use PapaParse to correctly parse the CSV and infer the delimiter.
  2. Step 2: Split the original CSV content by lines, and manually scan through each line.
    • Track quoted sections so delimiters inside quotes are ignored.
    • Each time a delimiter is encountered outside of quotes, the length of the value (including whitespace and quotes) is recorded.
    • After processing each line, record the lengths of the values.

Example Output:

For the input:

name,age,city
"John, Doe", 30 ,"New York"
Jane, 25, "Los Angeles"

You would get something like:

Delimiter: ,
Value lengths per row: 
[
  [4, 3, 4], // For "name,age,city"
  [11, 5, 10], // For '"John, Doe", 30 ,"New York"'
  [4, 3, 13]  // For 'Jane, 25, "Los Angeles"'
]

This gives you the exact length of the raw values between the delimiters, including any quotes and whitespace.

Let me know if this works for your needs or if you'd like further refinements!

GSmithApps commented 3 weeks ago

Fixed in 0.1.0