Read header only from CSV

adaltas / node-csv

Full featured CSV parser with simple api and tested against large datasets.

MIT License

4.05k stars 267 forks source link

Summary

We need to read parts of the CSV only That is, we need to not store in memory the complete "parsed" CSVs

Specifically, we need to only store

Only the header
(Possibly in an iterative manner) Only a previously specified column

Motivation

We are processing heavy CSV files (5-50GB) for which we are using PySpark in the backend instead of JS
We need to do fast verifications on the frontend before the user actually uploads the files to the backend
The CSV should be processed minimaly on the frontend to avoid using user resourses (significatively more than the needed)

Alternative

We could read the file "by hand" to get the first line only
- Howerver that way we lose CSV formating csv-parse does for us which is very useful given that we are dealing with many different formats (separator, quotation marks, character sets)
We could also make the verifications in python on the backend before parallel processing which is possible with pandas (both, header only and specific columns only)

Draft

With respect to the header functionality I think reading the first line of the file would be required for which I found this thread implementing the functionality with FileReader
With respect to reading a specific column the algorithm would know the position number of the column and would read each line until it finds the position, but would not store in memory the entire file, just the values in the right column
- This sounds hard but I found this post which uses a ReadStrem to iteratively read fixed sized (10 MB) chunks of data
- I guess csv-parse has logic to precisely detect separator instances, which would allow this iterative reading to correctly extract and store the correct column values

Additional context

We plan to use a csv-parse to do many such lightweight analysis because the data must not reach backend if any of the specified criteria are not met

This lightweight analysis would very valuable because it can be done on user machines in a not so expensive way

I am trying to understand:

you need to extract the columns names from the first line
you wish to keep in memory only one column out of the all dataset

Regarding 1, it seems like the column option could do just fine, or get the row position like in the example below.

Regarding 2, you can iterate over the stream and put only the right column in memory.

Here is an example implementing 1 and 2:

import assert from 'assert'
import { parse } from 'csv-parse'
import { generate } from 'csv-generate'

// Expected data
let headers = null
const third_column = []
// Fake readable stream
const parser = generate({
  high_water_mark: 64 * 64,
  length: 100,
  seed: 1
}).pipe(
  parse()
);
// Intialise count
let count = 0;
// Iterate through each records
for await (const record of parser) {
  if(count++ === 0){
    // 1. Extract the columns names from the first line
    headers = record
  } else {
    // 2. Keep in memory only one column out of the all dataset
    third_column.push(record[2])
  }
}
// Validation
assert.deepStrictEqual(headers, [
  'OMH',
  'ONKCHhJmjadoA',
  'D',
  'GeACHiN',
  'nnmiN',
  'CGfDKB',
  'NIl',
  'JnnmjadnmiNL'
])
assert.deepStrictEqual(third_column, [
  'fENL',             'gGeBFaeAC',        'jPbhKCHhJn',
  'DKCHjONKCHi',      'LEPPbgI',          'dmkeACHgG',
  'BDLDLF',           'C',                'kdnmiLENJo',
  'A',                'PPaeACGfCIkcj',    'oABFaepBFbgGeBFb',
  'ENJnmj',           'dlhKAABGdlhJnn',   'OPPPbfDK',
  'PbgGeoACJmjPa',    'LD',               'lfDKBFbhKp',
  'LE',               'lhIkepCIkdmjbhI',  'jPPadnlhIl',
  'Ge',               'ONKACJnlhJnnm',    'NLF',
  'clfCGepBFaeBDL',   'kdmiN',            'AACIlf',
  'jPbfE',            'gHiN',             'GclgHgG',
  'CJnmjPbhJoA',      'nlhKCGeABFcjPbh',  'afCJnmiMIleBE',
  'fDMHgHhJnlhKBF',   'keAADJopDL',       'mjaclgFciO',
  'LE',               'EONKCH',           'gFckdnnnooABEN',
  'pCIlgGfCHhKAABDM', 'clgFaeBGdmjPbh',   'jPbgGdnnmjbgGfE',
  'MGdmkepE',         'GfDLEPa',          'JopEOMGfEOON',
  'jbfEPPbiMG',       'CHgGeACJnmkclfDK', 'bhJ',
  'gIjP',             'pDKACIjPaep',      'fDMG',
  'Kp',               'gFcj',             'DJpBFbgFaeA',
  'iLFbgGdnl',        'Jn',               'ADJopC',
  'eACGfCJnmjONLF',   'Ge',               'NKAADJnn',
  'iMHh',             'PadnlgHgHiMGd',    'ABDJpCIjaclfEMHi',
  'nlgGf',            'Ge',               'B',
  'kclgFbgFbhIlgGf',  'jbhIlgHgGf',       'MIjPadnmkclfD',
  'Hg',               'mjbgH',            'GfDLDKB',
  'ENLFafCGfEN',      'Gf',               'hKCHiMHgFbiMIl',
  'biLEPPPafCIkdm',   'hKpD',             'keoBE',
  'Hg',               'D',                'KACHhKAABDLF',
  'NKA',              'HiMIkep',          'C',
  'biMGdmkdnnlgFc',   'CGeAACJnlgG',      'FbgGfCIkdnnl',
  'PafCHgFbiMIkb',    'LD',               'gGfENLEOP',
  'IjOPPPa',          'eBFbhJpCIkdlhK',   'IjPbhIlfDMGf',
  'dnmlfCGepCHhIm',   'nmjPaeoACIkepEM',  'oBDLEOONIlh',
  'A',                'NKB',              'EOMHhIlhIkd'
])

adaltas / node-csv

Read header only from CSV #366