Open asfimport opened 2 years ago
Nicola Crane / @thisisnic: Thanks for opening this issue @martindut. I think the problem here is that the CSV reader isn't expecting the footer row and is just treating it as data (and so you get that error as it's expecting as many columns as are in the actual data). The C++ code includes the ability to skip footer rows, but this isn't exposed at the R level (yet).
Martin du Toit / @martindut: Hi @thisisnic , thanks for getting back to me.
I also tried it with pyarrow, although not my preferred language, but got the same error. Is it possible with pyarrow, or also not exposed to pyarrow?
Nicola Crane / @thisisnic:
I think it is; the skip_rows_after
parameter is mentioned in the docs here: https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions.
If you share a reproducible example of the Python code you tried, I'll see if I can work out how to connect it up (or shout someone else who knows more about pyarrow!)
Martin du Toit / @martindut: I actually used reticulate to call the pyarrow functions, not sure if that might be the problem?
pa <- reticulate::import("pyarrow", convert = FALSE)
pyds <- reticulate::import("pyarrow.dataset", convert = FALSE)
skiphead <- 2 # skips header row and column names
fs_conv = pa$csv$ConvertOptions(column_types = schema_file, strings_can_be_null = TRUE, include_missing_columns = TRUE, include_columns = col_names)
fs_pars = pa$csv$ParseOptions(delimiter= delimiter, ignore_empty_lines = TRUE)
fs_read = pa$csv$ReadOptions(skip_rows = skiphead, column_names = col_names)
csv_format = pyds$CsvFileFormat(read_options = fs_read, parse_options = fs_pars, convert_options = fs_conv)
ds = pyds$dataset(source=dl_path, schema = schema_df, format=csv_format, filesystem=fs, partitioning=.partition_cols)
Nicola Crane / @thisisnic:
Sorry @martindut; I just realised that I'd misread the docs entirely there and the parameter is actually called skip_rows_after_names
which is to do with skipping rows after columns rather than footers. I don't think the functionality for working with CSVs with footers is implemented at all in Arrow. I can open a ticket to request this functionality though.
Martin du Toit / @martindut: No problem @thisisnic . The functionality is very important for me, so yes please, can you open a ticket to request this functionality.
Nicola Crane / @thisisnic: Here it is: ARROW-15254
Martin du Toit / @martindut: Hi @thisisnic , the other option is to not throw an error, but populate the "missing" columns with null values, so it can be filtered out after reading the dataset. I though that include_missing_columns = TRUE will do the trick, or am I not understanding this flag correctly?
Nicola Crane / @thisisnic: @martindut Looking at the docs here[1] I don't think it will but could be wrong - what happens when you try to run your code with that parameter set to TRUE?
[1] https://arrow.apache.org/docs/r/reference/CsvReadOptions.html
Martin du Toit / @martindut: @thisisnic I tried it with parameter set to TRUE, but still giving error. No worries, currently have a workaround, but will be nice to have this in a future release.
Not sure if this is a bug, but if I open_dataset of a directory containing csv files with a header and a footer, I specify the following convert options to include_missing_columns. The code works fine on files with no header and footer
The dataset gets created, but any further operation on the dataset fail with
Reporter: Martin du Toit / @martindut
Related issues:
Original Issue Attachments:
Note: This issue was originally created as ARROW-15252. Please see the migration documentation for further details.