apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.67k stars 3.56k forks source link

[Ruby] Segmentation Error Returned when Executing Get S3 File Method from Arrow Dataset Library #38156

Open bugbare opened 1 year ago

bugbare commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

The following method is executed as part of a step definition to retrieve an uploaded parquet file and transform it into a ruby hash data structure, it works at runtime...

def retrieve_s3_records(bucket, uri)
    file = URI("s3://#{CGI.escape(Env.aws_access_key_id)}:#{CGI.escape(Env.aws_secret_access_key)}@#{bucket}/#{uri}")
    Arrow::FileSystem::FinalizeS3
    table = Arrow::Table.load(file)
    records = []
    table.length.times do |i|
      records[i] = table.slice(i).to_h
      # puts "\n\n#{records[i]}\n\n"
    end
    updated_records = convert_to_hash(records)
    CreateTestData.stringify_data(updated_records)
end

however the following output is appended to the end of the execution:

./cpp/src/arrow/filesystem/s3fs.cc:2829:  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This could lead to a segmentation fault at exit
/usr/local/bundle/bin/cucumber: [BUG] Segmentation fault at 0x0000000000000000
ruby 3.2.0 (2022-12-25 revision a528908271) [x86_64-linux]

-- Machine register context ------------------------------------------------
 RIP: 0x0000000000000000 RBP: 0x000055e9474b4600 RSP: 0x00007fff79686828
 RAX: 0x000055e94763db50 RBX: 0x00007fcacfa4ac00 RCX: 0x00007fcacfa35bb8
 RDX: 0x000055e9478d8c70 RDI: 0x000055e9474b4600 RSI: 0x0000000000000007
  R8: 0x0000000000000007  R9: 0x000000000000000d R10: 0x0000000000000001
 R11: 0x0000000000246678 R12: 0x0000000000000000 R13: 0x0000000000000334
 R14: 0x00007fcacfa36e08 R15: 0x000055e946a90dd0 EFL: 0x0000000000010246

-- C level backtrace information -------------------------------------------
corrupted size vs. prev_size in fastbins

I am using the following arrow packages (taken from GemFile):

gem 'red-arrow', '~> 13.0'
gem 'red-arrow-dataset', '~> 13.0'
gem 'red-parquet', '~> 13.0'

I have also downgraded to 12.0.1 and 11.0 version for all three above but get the same issues... The error above seems to be catered for in Python, however I couldn't find a fix for this issue within a Ruby context...

Any help / Pointers would be much appreciated

Component(s)

Ruby

kou commented 1 year ago

Could you call Arrow.s3_finalize explicitly after you finished using all S3 related features and before you finish your program?

andrewhampton commented 2 months ago

FWIW, I dropped the dataset gem and manually retrieved the files from S3 to avoid this issue.

kou commented 2 months ago

Arrow.s3_finalize didn't solve your case?