Open jpfeuffer opened 2 months ago
@jpfeuffer thanks for the report!
Could you also share a (snippet of) the csv file (or a similar one with dummy data) to see if we can reproduce this?
Unfortunately nothing small enough to upload. And since it is not working I cannot produce a subset easily. (Not sure a 22 MB subset for GitHub would be very representative either).
Data is from here. Warning 15GB. https://ftp.enamine.net/download/REAL/Enamine_REAL_HAC_11_21_666M_CXSMILES.cxsmiles.bz2
By the way, I just saw that even if I do not limit the chunks (and the process succeeds without segfault), the written file is always corrupt statrting from a few thousand lines in, with lines being mixed together etc.
Is something wrong with my code? Regular decompression works fine.
Hmm so I found the "error". If I use exactly the schema from the first read batch, it works (no segfaults, no garbled output).
cnt = 0
writer_init = False
while cnt < args.num_chunks or args.num_chunks == 0:
try:
b = reader.read_next_batch()
if not writer_init:
print(b.schema)
writer = csv.CSVWriter(
output_file,
write_options=writeopt,
schema=b.schema
)
writer.write_batch(b)
cnt += 1
print("Finished chunk ", cnt)
except StopIteration:
break
print("Done. Closing file.")
if writer_init:
writer.close()
But if I print the schema:
smiles: string
id: string
So, what is different from manually specifying the schema like I did in the beginning?
schema = pa.schema([
('smiles', pa.string()),
('id', pa.string())
])
@jpfeuffer I was able to reproduce the segfault with the big file you provided. Running with gdb gives the following backtrace:
If I unpack the archive, and read the csv file directly (so without the automatic decompression), it doesn't segfault.
Based on that, trying to create a simpler reproducer with some generated data:
import pyarrow as pa
from pyarrow import csv
from pyarrow.tests.util import rands
size = 500_000
random_strings = [rands(10) for _ in range(size//100)]*100
table = pa.table({"col1": range(size), "col2": random_strings})
with pa.CompressedOutputStream("data.csv.bz2", "bz2") as out:
csv.write_csv(table, out)
Reading this compressed file with a similar script as your original file:
import pyarrow as pa
from pyarrow import csv
input_file = "data.csv.bz2"
output_file = "test_out.csv"
num_chunks = 2
schema = pa.schema([('col1', pa.string()), ('col2', pa.string())])
convertopt = csv.ConvertOptions(column_types=schema)
with csv.open_csv(input_file, convert_options=convertopt) as reader:
with csv.CSVWriter(output_file, schema=schema) as writer:
cnt = 0
while cnt < num_chunks or num_chunks == 0:
try:
writer.write_batch(reader.read_next_batch())
cnt += 1
print("Finished chunk ", cnt)
except StopIteration:
break
print("Done. Closing file.")
also segfaults (similarly, if I increase the num_chunks
so that the file is read until the end, it doesn't segfault).
That's great (I guess 😅). Btw: Did you see my comment with the update? It seems like it has something to do with the schema used.
Yes, was still trying to reproduce that observation ;) Now, as you mentioned, the schema is exactly the same, so it should be something else in the restructured code that causes the change.
By the way, I just saw that even if I do not limit the chunks (and the process succeeds without segfault), the written file is always corrupt statrting from a few thousand lines in, with lines being mixed together etc.
That's not something I see with my reproducer above, I think (with mixed together, you mean that there are newline markers missing?)
By the way, I just saw that even if I do not limit the chunks (and the process succeeds without segfault), the written file is always corrupt statrting from a few thousand lines in, with lines being mixed together etc.
That's not something I see with my reproducer above, I think (with mixed together, you mean that there are newline markers missing?)
Also, with the original file, after how rows/chunks do you see this? (to see if I can reproduce this part with the original file)
By the way, I just saw that even if I do not limit the chunks (and the process succeeds without segfault), the written file is always corrupt statrting from a few thousand lines in, with lines being mixed together etc.
That's not something I see with my reproducer above, I think (with mixed together, you mean that there are newline markers missing?)
Mmh I am not sure if it was 'just' missing newlines. It felt like it was even more garbled as if the concurrent streams are not flushed at the right time.
It was about after 1500 rows. I think if you do a few chunks and then tail the file you should see it.
With the original bz2 file, reading 2 chunks with a 2**30 blocksize (as in your original example), and then running tail -n 50 test_out.csv
gives:
Not being familiar with the file, but that looks OK?
Also, do you see that issue with garbled written file with the simplified reproducer with generated data that I posted above?
(FWIW, the incremental open_csv
function is single threaded, see https://arrow.apache.org/docs/python/csv.html#incremental-reading, so there should be no concurrent streams)
I also still see the segfault when using the restructured read/write code from https://github.com/apache/arrow/issues/43604#issuecomment-2275356020 (using the schema of the first batch to create the writer)
Further simplication (no need to also write the data out, just reading a subset is enough to trigger the segfault):
Generate some data:
import pyarrow as pa
from pyarrow import csv
from pyarrow.tests.util import rands
size = 500_000
random_strings = [rands(10) for _ in range(size//100)]*100
table = pa.table({"col1": range(size), "col2": random_strings})
with pa.CompressedOutputStream("data.csv.bz2", "bz2") as out:
csv.write_csv(table, out)
Read part of the file:
import pyarrow as pa
from pyarrow import csv
input_file = "data.csv.bz2"
output_file = "test_out.csv"
num_chunks = 2
schema = pa.schema([('col1', pa.string()), ('col2', pa.string())])
convertopt = csv.ConvertOptions(column_types=schema)
with csv.open_csv(input_file, convert_options=convertopt) as reader:
cnt = 0
while cnt < num_chunks or num_chunks == 0:
try:
batch = reader.read_next_batch()
cnt += 1
print("Finished reading chunk ", cnt)
except StopIteration:
break
print("Done. Closing file.")
Backtrace from running the above with gdb:
The backtrace points to the threadpool / task spawning code (not very familiar with this part).
cc @pitrou
Ok, I looked a bit into this and it's quite complex.
The crash happens at shutdown after the CSV streaming reader was destroyed. Since the streaming reader does some readahead by default (ReadOptions.use_threads = true
), some read tasks are still running on the thread pool. This is driven by the ReadaheadGenerator
which doesn't appear to be able to cleanup when all references to it are lost. @westonpace
Hmm, I might have a fix actually.
Describe the bug, including details regarding any error messages, version, and platform.
I am getting reproducible segfaults when trying to close a CSVWriter in the simplest code possible.
Output for num_chunks = 2 e.g.: It hangs for a long time after "Closing file."
linux aarch64, pyarrow 17 (tried both pip and conda-forge), also tried many combinations of filesystem, pa.OSFile, batch_sizes, closing manually vs contexts, etc. My input file is b2zipped.
The culprit seems to be the limiting of the number of chunks that I added for testing. Writing ALL chunks seems to work.
Component(s)
Python