Closed imhaeussler closed 1 year ago
I am trying to understand:
Regarding 1, it seems like the column option could do just fine, or get the row position like in the example below.
Regarding 2, you can iterate over the stream and put only the right column in memory.
Here is an example implementing 1 and 2:
import assert from 'assert'
import { parse } from 'csv-parse'
import { generate } from 'csv-generate'
// Expected data
let headers = null
const third_column = []
// Fake readable stream
const parser = generate({
high_water_mark: 64 * 64,
length: 100,
seed: 1
}).pipe(
parse()
);
// Intialise count
let count = 0;
// Iterate through each records
for await (const record of parser) {
if(count++ === 0){
// 1. Extract the columns names from the first line
headers = record
} else {
// 2. Keep in memory only one column out of the all dataset
third_column.push(record[2])
}
}
// Validation
assert.deepStrictEqual(headers, [
'OMH',
'ONKCHhJmjadoA',
'D',
'GeACHiN',
'nnmiN',
'CGfDKB',
'NIl',
'JnnmjadnmiNL'
])
assert.deepStrictEqual(third_column, [
'fENL', 'gGeBFaeAC', 'jPbhKCHhJn',
'DKCHjONKCHi', 'LEPPbgI', 'dmkeACHgG',
'BDLDLF', 'C', 'kdnmiLENJo',
'A', 'PPaeACGfCIkcj', 'oABFaepBFbgGeBFb',
'ENJnmj', 'dlhKAABGdlhJnn', 'OPPPbfDK',
'PbgGeoACJmjPa', 'LD', 'lfDKBFbhKp',
'LE', 'lhIkepCIkdmjbhI', 'jPPadnlhIl',
'Ge', 'ONKACJnlhJnnm', 'NLF',
'clfCGepBFaeBDL', 'kdmiN', 'AACIlf',
'jPbfE', 'gHiN', 'GclgHgG',
'CJnmjPbhJoA', 'nlhKCGeABFcjPbh', 'afCJnmiMIleBE',
'fDMHgHhJnlhKBF', 'keAADJopDL', 'mjaclgFciO',
'LE', 'EONKCH', 'gFckdnnnooABEN',
'pCIlgGfCHhKAABDM', 'clgFaeBGdmjPbh', 'jPbgGdnnmjbgGfE',
'MGdmkepE', 'GfDLEPa', 'JopEOMGfEOON',
'jbfEPPbiMG', 'CHgGeACJnmkclfDK', 'bhJ',
'gIjP', 'pDKACIjPaep', 'fDMG',
'Kp', 'gFcj', 'DJpBFbgFaeA',
'iLFbgGdnl', 'Jn', 'ADJopC',
'eACGfCJnmjONLF', 'Ge', 'NKAADJnn',
'iMHh', 'PadnlgHgHiMGd', 'ABDJpCIjaclfEMHi',
'nlgGf', 'Ge', 'B',
'kclgFbgFbhIlgGf', 'jbhIlgHgGf', 'MIjPadnmkclfD',
'Hg', 'mjbgH', 'GfDLDKB',
'ENLFafCGfEN', 'Gf', 'hKCHiMHgFbiMIl',
'biLEPPPafCIkdm', 'hKpD', 'keoBE',
'Hg', 'D', 'KACHhKAABDLF',
'NKA', 'HiMIkep', 'C',
'biMGdmkdnnlgFc', 'CGeAACJnlgG', 'FbgGfCIkdnnl',
'PafCHgFbiMIkb', 'LD', 'gGfENLEOP',
'IjOPPPa', 'eBFbhJpCIkdlhK', 'IjPbhIlfDMGf',
'dnmlfCGepCHhIm', 'nmjPaeoACIkepEM', 'oBDLEOONIlh',
'A', 'NKB', 'EOMHhIlhIkd'
])
Closing since there was no follow up.
Summary
We need to read parts of the CSV only That is, we need to not store in memory the complete "parsed" CSVs
Specifically, we need to only store
Motivation
Alternative
Draft
With respect to the header functionality I think reading the first line of the file would be required for which I found this thread implementing the functionality with
FileReader
With respect to reading a specific column the algorithm would know the position number of the column and would read each line until it finds the position, but would not store in memory the entire file, just the values in the right column
Additional context
We plan to use a csv-parse to do many such lightweight analysis because the data must not reach backend if any of the specified criteria are not met
This lightweight analysis would very valuable because it can be done on user machines in a not so expensive way