adaltas / node-csv

Full featured CSV parser with simple api and tested against large datasets.
https://csv.js.org
MIT License
4.05k stars 267 forks source link

Read header only from CSV #366

Closed imhaeussler closed 1 year ago

imhaeussler commented 2 years ago

Summary

We need to read parts of the CSV only That is, we need to not store in memory the complete "parsed" CSVs

Specifically, we need to only store

Motivation

Alternative

Draft

Additional context

We plan to use a csv-parse to do many such lightweight analysis because the data must not reach backend if any of the specified criteria are not met

This lightweight analysis would very valuable because it can be done on user machines in a not so expensive way

wdavidw commented 2 years ago

I am trying to understand:

  1. you need to extract the columns names from the first line
  2. you wish to keep in memory only one column out of the all dataset

Regarding 1, it seems like the column option could do just fine, or get the row position like in the example below.

Regarding 2, you can iterate over the stream and put only the right column in memory.

Here is an example implementing 1 and 2:

import assert from 'assert'
import { parse } from 'csv-parse'
import { generate } from 'csv-generate'

// Expected data
let headers = null
const third_column = []
// Fake readable stream
const parser = generate({
  high_water_mark: 64 * 64,
  length: 100,
  seed: 1
}).pipe(
  parse()
);
// Intialise count
let count = 0;
// Iterate through each records
for await (const record of parser) {
  if(count++ === 0){
    // 1. Extract the columns names from the first line
    headers = record
  } else {
    // 2. Keep in memory only one column out of the all dataset
    third_column.push(record[2])
  }
}
// Validation
assert.deepStrictEqual(headers, [
  'OMH',
  'ONKCHhJmjadoA',
  'D',
  'GeACHiN',
  'nnmiN',
  'CGfDKB',
  'NIl',
  'JnnmjadnmiNL'
])
assert.deepStrictEqual(third_column, [
  'fENL',             'gGeBFaeAC',        'jPbhKCHhJn',
  'DKCHjONKCHi',      'LEPPbgI',          'dmkeACHgG',
  'BDLDLF',           'C',                'kdnmiLENJo',
  'A',                'PPaeACGfCIkcj',    'oABFaepBFbgGeBFb',
  'ENJnmj',           'dlhKAABGdlhJnn',   'OPPPbfDK',
  'PbgGeoACJmjPa',    'LD',               'lfDKBFbhKp',
  'LE',               'lhIkepCIkdmjbhI',  'jPPadnlhIl',
  'Ge',               'ONKACJnlhJnnm',    'NLF',
  'clfCGepBFaeBDL',   'kdmiN',            'AACIlf',
  'jPbfE',            'gHiN',             'GclgHgG',
  'CJnmjPbhJoA',      'nlhKCGeABFcjPbh',  'afCJnmiMIleBE',
  'fDMHgHhJnlhKBF',   'keAADJopDL',       'mjaclgFciO',
  'LE',               'EONKCH',           'gFckdnnnooABEN',
  'pCIlgGfCHhKAABDM', 'clgFaeBGdmjPbh',   'jPbgGdnnmjbgGfE',
  'MGdmkepE',         'GfDLEPa',          'JopEOMGfEOON',
  'jbfEPPbiMG',       'CHgGeACJnmkclfDK', 'bhJ',
  'gIjP',             'pDKACIjPaep',      'fDMG',
  'Kp',               'gFcj',             'DJpBFbgFaeA',
  'iLFbgGdnl',        'Jn',               'ADJopC',
  'eACGfCJnmjONLF',   'Ge',               'NKAADJnn',
  'iMHh',             'PadnlgHgHiMGd',    'ABDJpCIjaclfEMHi',
  'nlgGf',            'Ge',               'B',
  'kclgFbgFbhIlgGf',  'jbhIlgHgGf',       'MIjPadnmkclfD',
  'Hg',               'mjbgH',            'GfDLDKB',
  'ENLFafCGfEN',      'Gf',               'hKCHiMHgFbiMIl',
  'biLEPPPafCIkdm',   'hKpD',             'keoBE',
  'Hg',               'D',                'KACHhKAABDLF',
  'NKA',              'HiMIkep',          'C',
  'biMGdmkdnnlgFc',   'CGeAACJnlgG',      'FbgGfCIkdnnl',
  'PafCHgFbiMIkb',    'LD',               'gGfENLEOP',
  'IjOPPPa',          'eBFbhJpCIkdlhK',   'IjPbhIlfDMGf',
  'dnmlfCGepCHhIm',   'nmjPaeoACIkepEM',  'oBDLEOONIlh',
  'A',                'NKB',              'EOMHhIlhIkd'
])
wdavidw commented 1 year ago

Closing since there was no follow up.