apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.67k stars 3.56k forks source link

[D] D programming language Implementation of Arrow #44515

Open kassane opened 1 month ago

kassane commented 1 month ago

Describe the enhancement requested

Like Swift and other languages from small communities. I would like to suggest that the D language (v2) implementation be added to arrow-upstream (or a separate repository as a library).

outdated ref. library: https://github.com/ananis25/darrow Auto-generated: https://github.com/rostyboost/darrow

Component(s)

Integration

kou commented 1 month ago

Thanks for your suggestion. We need at least one maintainer/contributor to proceed this. Do you know any candidates for it?

kassane commented 1 month ago

Do you know any candidates for it?

I will try to do it.

Recently contributing to Apache OpenDAL also for D support.

kou commented 1 month ago

Great!

It seems that https://github.com/ananis25/darrow generates bindings from C header files. But we don't need to do it because Apache Arrow C GLib supports GObject Introspection. https://gi.readthedocs.io/en/latest/

GObject Introspection provides API related metadata. We can use them instead of parsing C header files.

It seems that there is a D tool for GObject Introspection: https://github.com/gtkd-developers/gir-to-d

Can we use it instead of parsing C header files?

kassane commented 1 month ago

It seems that there is a D tool for GObject Introspection: https://github.com/gtkd-developers/gir-to-d

Works!

# $PWD =  arrow/d
$ girtod -g ../c_glib.build/arrow-glib/ -i Arrow-1.0.gir -o source
$ girtod -g ../c_glib.build/arrow-glib/ -i ../arrow-dataset-glib/ArrowDataset-1.0.gir -o source
$ girtod -g ../c_glib.build/arrow-glib/ -i ../arrow-flight-glib/ArrowFlight-1.0.gir -o source
tree-ls output ```bash $ tree . . ├── README.md ├── dub.sdl └── source ├── arrow │   ├── AggregateNodeOptions.d │   ├── Aggregation.d │   ├── Array.d │   ├── ArrayBuilder.d │   ├── ArrayDatum.d │   ├── ArraySortOptions.d │   ├── AzureFileSystem.d │   ├── BaseBinaryScalar.d │   ├── BaseListScalar.d │   ├── BinaryArray.d │   ├── BinaryArrayBuilder.d │   ├── BinaryDataType.d │   ├── BinaryDictionaryArrayBuilder.d │   ├── BinaryScalar.d │   ├── BooleanArray.d │   ├── BooleanArrayBuilder.d │   ├── BooleanDataType.d │   ├── BooleanScalar.d │   ├── Buffer.d │   ├── BufferInputStream.d │   ├── BufferOutputStream.d │   ├── CSVReadOptions.d │   ├── CSVReader.d │   ├── CallExpression.d │   ├── CastOptions.d │   ├── ChunkedArray.d │   ├── ChunkedArrayDatum.d │   ├── Codec.d │   ├── CompressedInputStream.d │   ├── CompressedOutputStream.d │   ├── CountOptions.d │   ├── DataType.d │   ├── Date32Array.d │   ├── Date32ArrayBuilder.d │   ├── Date32DataType.d │   ├── Date32Scalar.d │   ├── Date64Array.d │   ├── Date64ArrayBuilder.d │   ├── Date64DataType.d │   ├── Date64Scalar.d │   ├── Datum.d │   ├── DayMillisecond.d │   ├── DayTimeIntervalArray.d │   ├── DayTimeIntervalArrayBuilder.d │   ├── DayTimeIntervalDataType.d │   ├── DayTimeIntervalScalar.d │   ├── Decimal128.d │   ├── Decimal128Array.d │   ├── Decimal128ArrayBuilder.d │   ├── Decimal128DataType.d │   ├── Decimal128Scalar.d │   ├── Decimal256.d │   ├── Decimal256Array.d │   ├── Decimal256ArrayBuilder.d │   ├── Decimal256DataType.d │   ├── Decimal256Scalar.d │   ├── DecimalDataType.d │   ├── DenseUnionArray.d │   ├── DenseUnionArrayBuilder.d │   ├── DenseUnionDataType.d │   ├── DenseUnionScalar.d │   ├── DictionaryArray.d │   ├── DictionaryDataType.d │   ├── DoubleArray.d │   ├── DoubleArrayBuilder.d │   ├── DoubleDataType.d │   ├── DoubleScalar.d │   ├── EqualOptions.d │   ├── ExecuteContext.d │   ├── ExecuteNode.d │   ├── ExecuteNodeOptions.d │   ├── ExecutePlan.d │   ├── Expression.d │   ├── ExtensionArray.d │   ├── ExtensionDataType.d │   ├── ExtensionDataTypeRegistry.d │   ├── ExtensionScalar.d │   ├── FeatherFileReader.d │   ├── FeatherWriteProperties.d │   ├── Field.d │   ├── FieldExpression.d │   ├── FileIF.d │   ├── FileInfo.d │   ├── FileInputStream.d │   ├── FileOutputStream.d │   ├── FileSelector.d │   ├── FileSystem.d │   ├── FileT.d │   ├── FilterNodeOptions.d │   ├── FilterOptions.d │   ├── FixedSizeBinaryArray.d │   ├── FixedSizeBinaryArrayBuilder.d │   ├── FixedSizeBinaryDataType.d │   ├── FixedSizeBinaryScalar.d │   ├── FixedWidthDataType.d │   ├── FloatArray.d │   ├── FloatArrayBuilder.d │   ├── FloatDataType.d │   ├── FloatScalar.d │   ├── FloatingPointDataType.d │   ├── Function.d │   ├── FunctionDoc.d │   ├── FunctionOptions.d │   ├── GCSFileSystem.d │   ├── GIOInputStream.d │   ├── GIOOutputStream.d │   ├── HDFSFileSystem.d │   ├── HalfFloatArray.d │   ├── HalfFloatArrayBuilder.d │   ├── HalfFloatDataType.d │   ├── HalfFloatScalar.d │   ├── HashJoinNodeOptions.d │   ├── ISO8601TimestampParser.d │   ├── IndexOptions.d │   ├── InputStream.d │   ├── Int16Array.d │   ├── Int16ArrayBuilder.d │   ├── Int16DataType.d │   ├── Int16Scalar.d │   ├── Int32Array.d │   ├── Int32ArrayBuilder.d │   ├── Int32DataType.d │   ├── Int32Scalar.d │   ├── Int64Array.d │   ├── Int64ArrayBuilder.d │   ├── Int64DataType.d │   ├── Int64Scalar.d │   ├── Int8Array.d │   ├── Int8ArrayBuilder.d │   ├── Int8DataType.d │   ├── Int8Scalar.d │   ├── IntArrayBuilder.d │   ├── IntegerDataType.d │   ├── IntervalDataType.d │   ├── JSONReadOptions.d │   ├── JSONReader.d │   ├── LargeBinaryArray.d │   ├── LargeBinaryArrayBuilder.d │   ├── LargeBinaryDataType.d │   ├── LargeBinaryScalar.d │   ├── LargeListArray.d │   ├── LargeListArrayBuilder.d │   ├── LargeListDataType.d │   ├── LargeListScalar.d │   ├── LargeStringArray.d │   ├── LargeStringArrayBuilder.d │   ├── LargeStringDataType.d │   ├── LargeStringScalar.d │   ├── ListArray.d │   ├── ListArrayBuilder.d │   ├── ListDataType.d │   ├── ListScalar.d │   ├── LiteralExpression.d │   ├── LocalFileSystem.d │   ├── LocalFileSystemOptions.d │   ├── MapArray.d │   ├── MapArrayBuilder.d │   ├── MapDataType.d │   ├── MapScalar.d │   ├── MatchSubstringOptions.d │   ├── MemoryMappedInputStream.d │   ├── MemoryPool.d │   ├── MockFileSystem.d │   ├── MonthDayNano.d │   ├── MonthDayNanoIntervalArray.d │   ├── MonthDayNanoIntervalArrayBuilder.d │   ├── MonthDayNanoIntervalDataType.d │   ├── MonthDayNanoIntervalScalar.d │   ├── MonthIntervalArray.d │   ├── MonthIntervalArrayBuilder.d │   ├── MonthIntervalDataType.d │   ├── MonthIntervalScalar.d │   ├── MutableBuffer.d │   ├── NullArray.d │   ├── NullArrayBuilder.d │   ├── NullDataType.d │   ├── NullScalar.d │   ├── NumericArray.d │   ├── NumericDataType.d │   ├── ORCFileReader.d │   ├── OutputStream.d │   ├── PrimitiveArray.d │   ├── ProjectNodeOptions.d │   ├── QuantileOptions.d │   ├── RankOptions.d │   ├── ReadOptions.d │   ├── ReadableIF.d │   ├── ReadableT.d │   ├── RecordBatch.d │   ├── RecordBatchBuilder.d │   ├── RecordBatchDatum.d │   ├── RecordBatchFileReader.d │   ├── RecordBatchFileWriter.d │   ├── RecordBatchIterator.d │   ├── RecordBatchReader.d │   ├── RecordBatchStreamReader.d │   ├── RecordBatchStreamWriter.d │   ├── RecordBatchWriter.d │   ├── ResizableBuffer.d │   ├── RoundOptions.d │   ├── RoundToMultipleOptions.d │   ├── RunEndEncodeOptions.d │   ├── RunEndEncodedArray.d │   ├── RunEndEncodedDataType.d │   ├── S3FileSystem.d │   ├── S3GlobalOptions.d │   ├── Scalar.d │   ├── ScalarAggregateOptions.d │   ├── ScalarDatum.d │   ├── Schema.d │   ├── SeekableInputStream.d │   ├── SetLookupOptions.d │   ├── SinkNodeOptions.d │   ├── SlowFileSystem.d │   ├── SortKey.d │   ├── SortOptions.d │   ├── SourceNodeOptions.d │   ├── SparseUnionArray.d │   ├── SparseUnionArrayBuilder.d │   ├── SparseUnionDataType.d │   ├── SparseUnionScalar.d │   ├── SplitPatternOptions.d │   ├── StreamDecoder.d │   ├── StreamListener.d │   ├── StrftimeOptions.d │   ├── StringArray.d │   ├── StringArrayBuilder.d │   ├── StringDataType.d │   ├── StringDictionaryArrayBuilder.d │   ├── StringScalar.d │   ├── StrptimeOptions.d │   ├── StrptimeTimestampParser.d │   ├── StructArray.d │   ├── StructArrayBuilder.d │   ├── StructDataType.d │   ├── StructFieldOptions.d │   ├── StructScalar.d │   ├── SubTreeFileSystem.d │   ├── Table.d │   ├── TableBatchReader.d │   ├── TableConcatenateOptions.d │   ├── TableDatum.d │   ├── TakeOptions.d │   ├── TemporalDataType.d │   ├── Tensor.d │   ├── Time32Array.d │   ├── Time32ArrayBuilder.d │   ├── Time32DataType.d │   ├── Time32Scalar.d │   ├── Time64Array.d │   ├── Time64ArrayBuilder.d │   ├── Time64DataType.d │   ├── Time64Scalar.d │   ├── TimeDataType.d │   ├── TimestampArray.d │   ├── TimestampArrayBuilder.d │   ├── TimestampDataType.d │   ├── TimestampParser.d │   ├── TimestampScalar.d │   ├── UInt16Array.d │   ├── UInt16ArrayBuilder.d │   ├── UInt16DataType.d │   ├── UInt16Scalar.d │   ├── UInt32Array.d │   ├── UInt32ArrayBuilder.d │   ├── UInt32DataType.d │   ├── UInt32Scalar.d │   ├── UInt64Array.d │   ├── UInt64ArrayBuilder.d │   ├── UInt64DataType.d │   ├── UInt64Scalar.d │   ├── UInt8Array.d │   ├── UInt8ArrayBuilder.d │   ├── UInt8DataType.d │   ├── UInt8Scalar.d │   ├── UIntArrayBuilder.d │   ├── UTF8NormalizeOptions.d │   ├── UnionArray.d │   ├── UnionArrayBuilder.d │   ├── UnionDataType.d │   ├── UnionScalar.d │   ├── VarianceOptions.d │   ├── WritableFileIF.d │   ├── WritableFileT.d │   ├── WritableIF.d │   ├── WritableT.d │   ├── WriteOptions.d │   └── c │   ├── functions.d │   └── types.d ├── arrowdataset │   ├── CSVFileFormat.d │   ├── Dataset.d │   ├── DatasetFactory.d │   ├── DirectoryPartitioning.d │   ├── FileFormat.d │   ├── FileSystemDataset.d │   ├── FileSystemDatasetFactory.d │   ├── FileSystemDatasetWriteOptions.d │   ├── FileWriteOptions.d │   ├── FileWriter.d │   ├── FinishOptions.d │   ├── Fragment.d │   ├── HivePartitioning.d │   ├── HivePartitioningOptions.d │   ├── IPCFileFormat.d │   ├── InMemoryFragment.d │   ├── KeyValuePartitioning.d │   ├── KeyValuePartitioningOptions.d │   ├── ParquetFileFormat.d │   ├── Partitioning.d │   ├── PartitioningFactoryOptions.d │   ├── Scanner.d │   ├── ScannerBuilder.d │   └── c │   ├── functions.d │   └── types.d └── arrowflight ├── CallOptions.d ├── Client.d ├── ClientOptions.d ├── CommandDescriptor.d ├── Criteria.d ├── DataStream.d ├── Descriptor.d ├── DoPutResult.d ├── Endpoint.d ├── Info.d ├── Location.d ├── MessageReader.d ├── MetadataReader.d ├── MetadataWriter.d ├── PathDescriptor.d ├── RecordBatchReader.d ├── RecordBatchStream.d ├── RecordBatchWriter.d ├── ServableIF.d ├── ServableT.d ├── Server.d ├── ServerAuthHandler.d ├── ServerAuthReader.d ├── ServerAuthSender.d ├── ServerCallContext.d ├── ServerCustomAuthHandler.d ├── ServerOptions.d ├── StreamChunk.d ├── StreamReader.d ├── StreamWriter.d ├── Ticket.d └── c ├── functions.d └── types.d 8 directories, 349 files ```
kassane commented 1 month ago

Can we use it instead of parsing C header files?

It's possible. However, need manual fixes, like: https://github.com/kassane/arrow/commit/4bcfed11981b2552e07868d8bb8ea6022d0bcf0b

[!NOTE] c/functions.d changes are affected by --use-runtime-linker

# $PWD = arrow
$ dub test --root=d -f
             Generating test runner configuration 'arrow-d-test-unittest' for 'unittest' (library).
     Pre-gen Running commands for glibd
    Existing package girtod found locally
0 packages fetched, 1 already present, 0 failed
             Building package girtod in /home/kassane/.dub/packages/girtod/0.23.2/girtod/
     Pre-gen Running commands for girtod
    Starting Performing "debug" build using /usr/bin/ldc2 for x86_64.
    Building girtod 0.23.2: building configuration [application]
     Linking girtod
     Running ../../../girtod/0.23.2/girtod/girtod -i src -o generated --use-runtime-linker
copying file [src/gtkd] to [generated/gtkd]
    Starting Performing "unittest" build using /usr/bin/ldc2 for x86_64.
    Building glibd 2.4.3+commit.2.g1546823: building configuration [library]
    Building arrow-d ~master: building configuration [arrow-d-test-unittest]
source/arrow/GIOOutputStream.d(12,8): Error: `OutputStream` matches conflicting symbols:
public class GIOOutputStream : OutputStream
       ^
source/arrow/OutputStream.d(18,8):        class `arrow.OutputStream.OutputStream`
public class OutputStream : ObjectG, FileIF, WritableIF
       ^
../../.dub/packages/glibd/1546823185334c4727d378baf890fa13d9fa4cbd/glibd/generated/gio/OutputStream.d(58,8):        class `gio.OutputStream.OutputStream`
public class OutputStream : ObjectG
       ^
source/arrow/GIOOutputStream.d(55,9): Error: `OutputStream` matches conflicting symbols:
        public this(OutputStream gioOutputStream)
        ^
source/arrow/OutputStream.d(18,8):        class `arrow.OutputStream.OutputStream`
public class OutputStream : ObjectG, FileIF, WritableIF
       ^
../../.dub/packages/glibd/1546823185334c4727d378baf890fa13d9fa4cbd/glibd/generated/gio/OutputStream.d(58,8):        class `gio.OutputStream.OutputStream`
public class OutputStream : ObjectG
       ^
source/arrow/GIOOutputStream.d(76,22): Error: `OutputStream` matches conflicting symbols:
        public OutputStream getRaw()
                     ^
source/arrow/OutputStream.d(18,8):        class `arrow.OutputStream.OutputStream`
public class OutputStream : ObjectG, FileIF, WritableIF
       ^
../../.dub/packages/glibd/1546823185334c4727d378baf890fa13d9fa4cbd/glibd/generated/gio/OutputStream.d(58,8):        class `gio.OutputStream.OutputStream`
public class OutputStream : ObjectG
       ^
source/arrow/LargeListArray.d(138,27): Error: function `DataType arrow.LargeListArray.LargeListArray.getValueType()` does not override any function, did you mean to override `arrow.c.types.GArrowType arrow.Array.Array.getValueType()`?
        public override DataType getValueType()
                          ^
source/arrow/ListArray.d(134,27): Error: function `DataType arrow.ListArray.ListArray.getValueType()` does not override any function, did you mean to override `arrow.c.types.GArrowType arrow.Array.Array.getValueType()`?
        public override DataType getValueType()
                          ^
cyrusmsk commented 1 month ago

There is also another older implementation: https://github.com/rostyboost/darrow Also how do you think would be helpful to have a separate bindings for https://github.com/apache/arrow-nanoarrow? Because it has direct C API and it can be used through ImportC approach probably

kassane commented 1 month ago

Also how do you think would be helpful to have a separate bindings for https://github.com/apache/arrow-nanoarrow?

For C library, D importC solve (e.g. opendal_c header). However, C++ API does require manual intervention, because the existing bindgens for cpp2d have specific use cases, like cppconv.

Edit: I don't plan to support nanoarrow at the moment.

kassane commented 1 month ago

It's possible. However, need manual fixes, like

Having made some minor fixes, the biggest issues are linked to multiple inheritance by the auto-generated binding and conflicting members per inherited module.

commit tested: https://github.com/kassane/arrow/commit/86c9062cdcf441eaf9b731dd31e532542e28769f

Build: :ok: Run: :x:

# Arrow-libs: $PWD/build/release
# Arrow-glibs: $PWD/c_glib.build/arrow-glib, $PWD/c_glib.build/arrow-flight-glib, $PWD/c_glib.build/arrow-dataset-glib
$ LD_LIBRARY_PATH=$PWD/c_glib.build/arrow-glib:$PWD/build/release:$PWD/c_glib.build/arrow-flight-glib:$PWD/c_glib.build/arrow-dataset-glib dub test -f --root=d/
             Generating test runner configuration 'arrow-d-test-unittest' for 'unittest' (library).
     Pre-gen Running commands for glibd
    Existing package girtod found locally
0 packages fetched, 1 already present, 0 failed
             Building package girtod in /home/kassane/.dub/packages/girtod/0.23.2/girtod/
     Pre-gen Running commands for girtod
    Starting Performing "debug" build using /home/kassane/zig/ldc2-master/bin/ldc2 for x86_64.
    Building girtod 0.23.2: building configuration [application]
     Linking girtod
     Running ../../../girtod/0.23.2/girtod/girtod -i src -o generated --use-runtime-linker
copying file [src/gtkd] to [generated/gtkd]
    Starting Performing "unittest" build using /home/kassane/zig/ldc2-master/bin/ldc2 for x86_64.
    Building glibd 2.4.3+commit.2.g1546823: building configuration [library]
    Building arrow-d ~master: building configuration [arrow-d-test-unittest]
     Linking arrow-d-test-unittest
     Running arrow-d-test-unittest 
/home/kassane/arrow/d/arrow-d-test-unittest(+0x23f1b7) [0x65499791d1b7]
/usr/lib/libc.so.6(+0x3d1d0) [0x7a751a9151d0]
/home/kassane/arrow/c_glib.build/arrow-glib/libarrow-glib.so.1800(_Z20garrow_array_get_rawP12_GArrowArray+0xa) [0x7a751acb466a]
/home/kassane/arrow/c_glib.build/arrow-glib/libarrow-glib.so.1800(garrow_array_get_length+0x1e) [0x7a751acb522e]
/home/kassane/arrow/d/arrow-d-test-unittest(+0xf3db0) [0x6549977d1db0]
/home/kassane/arrow/d/arrow-d-test-unittest(+0xf3aeb) [0x6549977d1aeb]
/home/kassane/arrow/d/arrow-d-test-unittest(+0x23f1f8) [0x65499791d1f8]
/home/kassane/arrow/d/arrow-d-test-unittest(+0x24c417) [0x65499792a417]
/home/kassane/arrow/d/arrow-d-test-unittest(+0x24c949) [0x65499792a949]
/home/kassane/arrow/d/arrow-d-test-unittest(+0x24c3bc) [0x65499792a3bc]
/home/kassane/arrow/d/arrow-d-test-unittest(+0x24393f) [0x65499792193f]
/home/kassane/arrow/d/arrow-d-test-unittest(+0x23f0a4) [0x65499791d0a4]
/home/kassane/arrow/d/arrow-d-test-unittest(+0x246a1b) [0x654997924a1b]
/home/kassane/arrow/d/arrow-d-test-unittest(+0x246947) [0x654997924947]
/home/kassane/arrow/d/arrow-d-test-unittest(+0x24679d) [0x65499792479d]
/home/kassane/arrow/d/arrow-d-test-unittest(+0x14aec2) [0x654997828ec2]
/usr/lib/libc.so.6(+0x25e08) [0x7a751a8fde08]
/usr/lib/libc.so.6(__libc_start_main+0x8c) [0x7a751a8fdecc]
/home/kassane/arrow/d/arrow-d-test-unittest(+0xf3995) [0x6549977d1995]
Error Program exited with code -11

$ c++filt _Z20garrow_array_get_rawP12_GArrowArray
garrow_array_get_raw(_GArrowArray*)
kou commented 1 month ago

Great!

Let's work on this step-by-step. How about supporting only arrow-glib as the first step? We can add support for other modules such as arrow-flight-glib later.

Could you open a PR for it? I'll try it too.

In general, we want to avoid changing auto generated files. We want to improve the upstream like you did for https://github.com/gtkd-developers/gir-to-d/issues/45 instead.

FYI: We also use the approach for the C# Parquet bindings: https://github.com/apache/arrow/pull/41886 e.g.: https://github.com/gircore/gir.core/issues/1077