Closed saminbassiri closed 1 month ago
The additional features mentioned before have been successfully implemented.
The following features, as outlined in the PR message, have now been added:
resolving the previously noted issues with equality OPT onDenseMatrix<std::string>
.
Comparison Operators:
==
and !=
between two std::string
or two FixedStr16
.==
and !=
between two DenseMatrix<std::string>
or two DenseMatrix<FixedStr16>
.>
and <
between two std::string
or two FixedStr16
>
and <
between two DenseMatrix<std::string>
or two DenseMatrix<FixedStr16>
.+
on twostd::string
or two FixedStr16
. (note: since the result of Concatenation may not fit in FixedStr16
, the output of Concat is always std::string
).+
on twoDenseMatrix<std::string>
or two DenseMatrix<FixedStr16>
. (note: since the result of Concatenation may not fit in FixedStr16
, the output of Concat is always DenseMatrix<std::string>
).UPPER
and LOWER
on std::string
orFixedStr16
.UPPER
and LOWER
on DenseMatrix<std::string>
or DenseMatrix<FixedStr16>
.Initial tests for DenseMatrix<std::string>
and DenseMatrix<FixedStr16>
were implemented, verifying functionality for newly added features and data types.
Thank you for the thorough review and detailed feedback, @pdamme. I have considered the points you raised, and here is a summary of the changes:
String Comparisons: I updated the code to make string comparisons case-sensitive.
FixedStr16 Constructor: I reviewed and corrected the constructor to ensure it doesn’t read beyond the end of the string’s data. Hence, excluding the null character, FixedStr16
contains 15 characters.
CSV Parsing Tests: I extended the CSV parsing tests to cover the corner cases you mentioned and fixed bugs related to these new tests:
Numeric Casting (castSca-kernel): I modified the casting logic to avoid relying solely on stold
when converting string scalars to numeric types.
oneHot-Kernel: I updated the logic to ensure the recoded column maintains contiguous values, such as converting ["a", "a", "b", "b"] to [0, 0, 1, 1]. I also added a corresponding test case to verify the functionality.
Diversity in Test Data: I avoid columns like name and gender in the test data.
Platform-Specific C++ Types: I reviewed the code and replaced the platform-specific types with the appropriate type.
PR Update:
I have applied several changes related to this PR:
Handling Unsupported Result Types During String Casting:
[[deprecated]]
attribute, and a runtime error will be thrown if the CastSca
function is called with an invalid result type. This ensures safer and more predictable behavior.FixedStr16 Buffer Size Update:
FixedStr16
constructor has been updated to support 16-character strings without requiring a null terminator. Additionally, I have updated the test cases in CastObjTest.cpp
to reflect this change.
[DAPHNE-#629] Efficient Processing of String Data Sets in DAPHNE with FixedStr16 Class and std::string Class for DenseMatrix
Summary
This PR addresses issue #629 by enhancing the string support in DAPHNE, making it practical to process string data sets. The main addition is Generilizeing or specializing current template structures for
FixedStr16
class andstd::string
class. While significant progress has been made, additional features related to element-wise comparisons will be added in the upcoming days.Key Features Implemented
oneHot
: Applies one-hot-encoding to the given (n x m) matrix of strings.recode
: Applies dictionary encoding to the given (n x 1) matrix.Cast
: String value and matrix objects can be cast to a particular numeric type.fill
: Creates a matrix and sets all elements to a particular value.transpose
: Transposes a given matrix.Testing
DenseMatrix<std::string>
andDenseMatrix<FixedStr16>
have been implemented, verifying functionality for newly added features and data types. However, tests for std::strings are currently failing due to issues with element-wise OPT equality. These will be addressed with the upcoming features.Upcoming Features
The following features will be added in the next few days:
DenseMatrix<std::string>
and/orDenseMatrix<FixedStr16>
.DenseMatrix<std::string>
and /orDenseMatrix<FixedStr16>
.