MartinKoch123 / yaml

MATLAB YAML parser and emitter based on SnakeYAML
MIT License
22 stars 7 forks source link

No need to ConvertArray on numeric matrices #16

Closed AdamCooman closed 3 weeks ago

AdamCooman commented 1 month ago

yaml.dump performs num2cell and convertCell on every array. convertCell then uses a for-loop to add each of the elements of the cell array to an ArrayList. This has a lot of overhead, reducing performance when the amount of elements in the array is large. For numeric, logical and string vectors, this is not needed because of two reasons:

  1. Snakeyaml treats a object[] the same as an ArrayList of objects, so we can just pass a double[] or double[][] without needing to create an ArrayList first.
  2. Matlab converts its own arrays into the java equivalent when calling a java method (https://nl.mathworks.com/help/matlab/matlab_external/passing-data-to-java-methods.html). Yaml(dumperOptions).dump([1 2 3 4]) just works.

There is a caveat for the uints, where Matlab auto-casts them to Integer and Long, while it should be BigInteger as you implemented correctly.

Below, I implemented an alternative version of the convert function which tries to avoid the loops as much as possible, but which still passes all tests:

function result = convert(data,NULL_PLACEHOLDER)
if isempty(data)
    result = java.util.ArrayList();
    return
end
class_name = class(data);
switch class_name
    case {"double" "logical" "single" "int8" "int16" "int32" "int64" }
        result = nest(data);
    case {"uint8" "uint16" "uint32" "uint64" }
        % by default, matlab casts uint to Short/Integer/Long
        % it should cast to LongInteger to avoid issues with the maximum value
        result = nest(data);
        result = javaArray("java.math.BigInteger",size(result,1));
        for ind = 1 : numel(result)
            result = java.math.BigInteger(dec2hex(data(ind)), 16);
        end
    case "string"
        if any(contains(data, NULL_PLACEHOLDER))
            error("yaml:dump:NullPlaceholderNotAllowed", ...
                "Strings must not contain '%s' since it is used as a placeholder for null values.", NULL_PLACEHOLDER)
        end
        result = nest(data);
    case "cell"
        data = nest(data);
        result = java.util.ArrayList();
        for ind = 1:numel(data)
            result.add(convert(data{ind},NULL_PLACEHOLDER));
        end
    case "struct"
        if isscalar(data)
            result = java.util.LinkedHashMap();
            for key = string(fieldnames(data))'
                result.put(key, convert(data.(key),NULL_PLACEHOLDER));
            end
        else
            data = num2cell(data);
            result = convert(data,NULL_PLACEHOLDER);
        end
    case "char"
        result = java.lang.String(data);
    case "yaml.Null"
        result = java.lang.String(NULL_PLACEHOLDER);
    otherwise
        error("yaml:dump:TypeNotSupported", "Data type '%s' is not supported.", class(data))
end
end

I find that the performance of the dump function is better when using this convert function.

AdamCooman commented 1 month ago

To allow talking about performance, I created a bunch of benchmarks and ran these. The benchmarking repository can be found here: https://github.com/AdamCooman/yaml_benchmarking

When I compare using the latest version of the yaml code (https://github.com/AdamCooman/yaml/commit/625261f0bd6a1a6b3e4b22943f72831f871dcc41) versus a version where the dump file is optimized as explained here (https://github.com/AdamCooman/yaml/commit/fad929b5f3b5b6d07bd81fff5370e5b300ba0d5a) I obtain the following speedup:

improvement

MartinKoch123 commented 1 month ago

Hey, thanks for your work! I will incorporate it in a few days