This PR allows for the reactivation of the cpp inference categorify operator. This is required because the cpp categorify operator requires a different format for input data. This fix was implemented (instead of others) because I wanted to preserve the speed fo the cpp op and I wanted to get out a fix with the minimal amount of changes to the current code. There are other fixes that can be made if we want to spend more time on this that would be more correct. In an effort to make the categorify operation as fast as possible the operator relies on tuples representing ragged columns. This is important because in the current "regular" format for merlin, we use a dual column representation, where the the ragged column is expressed in two separate arrays, suffixed by "values" and "offsets". One more "correct" solution would be to change the c++ operator to understand the format used by merlin but that would require adding logic that may slowdown the overall operation, it would require string comparisons and looks up and the ability to track separate arrays that are part of the same column. Another possible solution would be to introduce a new format that brought back the ragged column tuple representation but that would require more code changes and could possibly have unforeseen effects to the rest of the library, we would need to rigorously test this change if we were to implement it. However, given that merlin is not an active priority, I have decided to go this route, for the reasons mentioned above. This solution only activates when the CategorifyTransform operator is present in the workflow, and it essentially reformats the data into what is expected by the operator and then reformats the result in the merlin "friendly" format.
This PR allows for the reactivation of the cpp inference categorify operator. This is required because the cpp categorify operator requires a different format for input data. This fix was implemented (instead of others) because I wanted to preserve the speed fo the cpp op and I wanted to get out a fix with the minimal amount of changes to the current code. There are other fixes that can be made if we want to spend more time on this that would be more correct. In an effort to make the categorify operation as fast as possible the operator relies on tuples representing ragged columns. This is important because in the current "regular" format for merlin, we use a dual column representation, where the the ragged column is expressed in two separate arrays, suffixed by "values" and "offsets". One more "correct" solution would be to change the c++ operator to understand the format used by merlin but that would require adding logic that may slowdown the overall operation, it would require string comparisons and looks up and the ability to track separate arrays that are part of the same column. Another possible solution would be to introduce a new format that brought back the ragged column tuple representation but that would require more code changes and could possibly have unforeseen effects to the rest of the library, we would need to rigorously test this change if we were to implement it. However, given that merlin is not an active priority, I have decided to go this route, for the reasons mentioned above. This solution only activates when the CategorifyTransform operator is present in the workflow, and it essentially reformats the data into what is expected by the operator and then reformats the result in the merlin "friendly" format.