LibertyDSNP / parquetjs

Fully asynchronous, pure JavaScript implementation of the Parquet file format with additional features
MIT License
43 stars 24 forks source link

JavaScript heap out of memory error while using opens3 and uploading on s3 using upload in aws sdk 3 #130

Open riddhi123 opened 6 days ago

riddhi123 commented 6 days ago

I am trying to read snappy.parquet file using let reader = await ParquetReader.openS3(s3, params);
and then uploading same file as csv.gz on s3 using below :

import { ParquetReader } from "@dsnp/parquetjs";
import { S3Client } from "@aws-sdk/client-s3";
const s3_client = new S3Client({});
import { Readable} from "stream";
import { Upload } from "@aws-sdk/lib-storage";
import { stringify } from 'csv-stringify';
import zlib from "zlib";

function main_function(params){
    var p1=new Promise(async(resolve,reject)=>{
        try{
            const s3 = new S3Client({
                region: 'ap-southeast-2'
              });
            let reader = await ParquetReader.openS3(s3, params);                 
            const upload = new Upload({
                client: s3,
                params: {
                     "Bucket": "test-csv",                    
                    "Key": 'originalCsv/test.csv.gz',
                    "Body": Readable.from(reader).pipe(stringify({header:true})).pipe(zlib.createGzip())
                },
                queueSize: 9
            });
            upload.on("httpUploadProgress", (progress) => {
                console.log(progress);
            });
            var uploadResonse=await upload.done();
            console.log('uploadResonse::',uploadResonse);
            resolve ('done');
        }catch(error){
            reject(error);
        }
    });
    return p1;
}

Earlier (May 2023), code working fine with >=100MB snappy.parquet file on node 16 version but code moved on version 20 and now code is giving below error for 34MB file on node both version 16 + 20 :

<--- Last few GCs --->

[21836:00000131C04D35E0] 169720 ms: Scavenge 2042.1 (2048.8) -> 2041.3 (2048.5) MB, 1.5 / 0.0 ms (average mu = 0.170, current mu = 0.131) allocation failure; [21836:00000131C04D35E0] 169725 ms: Scavenge 2042.9 (2049.5) -> 2041.7 (2051.0) MB, 1.2 / 0.0 ms (average mu = 0.170, current mu = 0.131) allocation failure; [21836:00000131C04D35E0] 169731 ms: Scavenge 2044.3 (2052.9) -> 2042.2 (2051.5) MB, 1.3 / 0.0 ms (average mu = 0.170, current mu = 0.131) allocation failure;

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory 1: 00007FF74150234F node_api_throw_syntax_error+179983 2: 00007FF741486986 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+61942 3: 00007FF741488693 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+69379 4: 00007FF741FC6411 v8::Isolate::ReportExternalAllocationLimitReached+65 5: 00007FF741FB1066 v8::internal::V8::FatalProcessOutOfMemory+662 6: 00007FF741E17770 v8::internal::EmbedderStackStateScope::ExplicitScopeForTesting+144
7: 00007FF741E24172 v8::internal::Heap::PublishPendingAllocations+1106 8: 00007FF741E21963 v8::internal::Heap::PageFlagsAreConsistent+3171 9: 00007FF741E13FA3 v8::internal::Heap::CollectGarbage+2723 10: 00007FF741E1C2AA v8::internal::Heap::GlobalSizeOfObjects+266 11: 00007FF741E6C80F v8::internal::StackGuard::HandleInterrupts+879 12: 00007FF741AF0F56 v8::internal::Runtime::SetObjectProperty+26918 13: 00007FF74206FA61 v8::internal::SetupIsolateDelegate::SetupHeap+606705 14: 00007FF6C2372D0A

what could be the issue ??

wilwade commented 6 days ago

@riddhi123 Two quick questions:

  1. Which version of the library were you on when it was working?
  2. Which library version are you using now?
wilwade commented 6 days ago

Also could you try v1.6.2 and see if that has the issue?

My guess is it is something to do with v1.7.0 that had a lot of dependency updates and the issue is somewhere in there.

riddhi123 commented 6 days ago

@riddhi123 Two quick questions:

  1. Which version of the library were you on when it was working?
  2. Which library version are you using now?
  1. working on v1.2.1 but not now 2.latest v1.7.0
riddhi123 commented 6 days ago

v1.6.2

same error with v1.6.2

image

wilwade commented 6 days ago

@riddhi123 Hmm... Lots of changes since v1.2.1. If it is easy could you test v1.5.0? 1.6 did some major updates to support v3 of the AWS sdk. I suspect that's where the issue appeared.

riddhi123 commented 5 days ago

@wilwade with v1.5.0 on node 16 getting same error and on node 20 getting error as client.getObject is not a function\n at ParquetEnvelopeReader.readFn (/opt/nodejs/node_modules/@dsnp/parquetjs/dist/lib/reader.js:392:36)\n as v1.5.0 not supporting v3 changes

shannonwells commented 1 day ago

@riddhi123 I can't reproduce your error with our test files. Can you please post a link to the test file you have that's failing? Meanwhile I will attempt to create a test file like the one you've described. I've also modified your test script so it doesn't overwrite the parquet file with a gzipped CSV so I can keep rerunning the test.

shannonwells commented 1 day ago

@riddhi123 Have you tried increasing memory allocation with --max-old-space-size? I was able to create a snappy-compressed parquet file of over 40 MB and uploaded it to an S3 Bucket. I then ran your download & convert code above, which succeeded. I downloaded the resulting gzipped file and it uncompressed fine and has expected CSV content. So we will need your test file to try to reproduce the error you are seeing. Below is the end of the output of this script, and a screenshot showing the two files on an S3 bucket, the gz file is the output of your stringify/zip line.

...
{
  loaded: 32129742,
  total: undefined,
  part: 7,
  Key: 'testBig.snappy.parquet.gz',
  Bucket: 'my-bucket'
}
uploadResponse:: {
  '$metadata': {
    httpStatusCode: 200,
    requestId: '55ZAJFKW48QVA811',
    extendedRequestId: '8FqKwastV2ZoTRCAjU5kNseMlmlf+XZmvQ2XYuukVhOFzlSGZ9gifYcYkT0ppjGYpMOVflDB8iM=',
    cfId: undefined,
    attempts: 1,
    totalRetryDelay: 0
  },
  ServerSideEncryption: 'AES256',
  Bucket: 'my-bucket',
  ETag: '"92a4f2c0ff33b6eadeadbeefdbdbdbdbb-7"',
  Key: 'testBig.snappy.parquet.gz',
  Location: 'https://my-bucket.s3.cn-east-99.amazonaws.com/testBig.snappy.parquet.gz'
}
Screenshot 2024-07-01 at 4 59 00 PM