cjdoris / ARFFFiles.jl

Load and save ARFF files
MIT License
5 stars 2 forks source link

Refactoring #12

Closed jbrea closed 3 years ago

jbrea commented 3 years ago

I played with some approaches to refactor the code just to understand what does and what doesn't matter for performance. I think performance has stayed the same with this version but I don't think the code has become any better really, it's just different :)

What puzzles me is the importance of the function that starts on line 864. I tried several alternatives but most of them lead to much more allocations and therefore slower loading. I don't really understand the magic that is happening there.

Feel free to close this PR without merging. I did this more for my own education and wanted to share it with you in case you are interested.

cjdoris commented 3 years ago

Interesting. I guess the main changes are you've added pushmissing! and pushval!, which use dispatch (on the element type) instead of branching (on kind). It's all equivalent though - either way the choice is known to the compiler.

The function at 864 parses a single item of data (e.g. a number or a string). It explicitly branches over the 8 types it needs to parse (numeric, numeric or missing, string, string or missing, etc.) because each case requires different code to do the parsing. If we didn't do this branch and used more generic code, then there would have to be some type instability or dynamic dispatch in order to handle the 8 type cases. Since this function is in a hot loop, this would slow things down a lot.

It's a lot like if you're calling some function f(x) where x is one of 8 types but the compiler doesn't know which, then it'll have to do dynamic dispatch to pick a method of f. If you're very lucky the compiler might know that x isa Union{T1, T2, ..., T8} and might generate fast dispatch code. It's much more likely the compiler will have given up and only know x isa Any and needs to do generic slow dynamic dispatch. In this scenario, you can essentially do the dispatch yourself by explicitly doing if x isa T1; f(x); elseif x isa T2; f(x); ...; end so that within each branch the type of x is known and you get a nice type-stable statically-dispatched function.

jbrea commented 2 years ago

Ah, very nice. Thanks for sharing these insights.