Open TheCedarPrince opened 6 months ago
So I have created PR https://github.com/JuliaHealth/IPUMS.jl/pull/30 to help improve on the performance. @TheCedarPrince is reviewing this. We will work on further improvements as well.
That was merged @00krishna -- thanks for adding this in. Are you still able to post that minimal working example to Discourse? I'd be curious to see how we could keep hacking at it someday to make it as fast as tidyr's reader.
Hey @00krishna,
These are my thoughts on your recent implementation for #21! As I mentioned over Slack, the functionality is sufficient enough that I went ahead and merged the PR, but as we discussed, there does seem to be some places for improvements. I am going to copy some of our discussion to here first so as to not lose thoughts as well as add in my own thoughts and comments along the way.
Basic Benchmarking
For a dataset with about 70 million elements, I see this time and allocation count:
For a dataset with only 60,000 elements, I see this time and allocation count:
Interestingly, when I tried using
ipumsr
to do this, the loading for both files was nearly instantaneous. I did some digging and found that hipread seems to be doing a lot of the heavy-lifting withinipumsr
to do the parsing of this file.Thoughts on Slowdowns
Per Krishna:
After doing some further analysis using Profile.jl and PProf.jl, I was able to find the following flamegraph profiling on allocations in the code:
Although the image is not so great, one can see that a majority of the allocations occurs in the map function call as you were imagining, Krishna. Here's the code I ran to generate this flamegraph:
I attached the allocation profile file here as well.
alloc-profile.pb.gz
I also did a quick analysis of the computation. It would appear that much of the time the function is running, it is trying its best at type inference. This flame graph is a bit inscrutable but I figured I would include it here as well:
Concluding Thoughts
In short, it seems like there is a huge opportunity to optimize the loading functionality to make it more straightforwardly take advantage of the fixed width files. Although, I am not entirely sure how to do that yet. I'd be curious your thoughts @00krishna.
Otherwise, great stuff!