JuliaHealth / IPUMS.jl

A convenience package to work with IPUMS data
http://juliahealth.org/IPUMS.jl/
MIT License
2 stars 2 forks source link

[IMPROVEMENT] Thoughts on DDI Data Loader #22

Open TheCedarPrince opened 2 months ago

TheCedarPrince commented 2 months ago

Hey @00krishna,

These are my thoughts on your recent implementation for #21! As I mentioned over Slack, the functionality is sufficient enough that I went ahead and merged the PR, but as we discussed, there does seem to be some places for improvements. I am going to copy some of our discussion to here first so as to not lose thoughts as well as add in my own thoughts and comments along the way.

Basic Benchmarking

For a dataset with about 70 million elements, I see this time and allocation count:

julia> @btime load_ipums_extract(ddi, "usa_00001.dat");
  130.577 s (475777044 allocations: 15.20 GiB) 

For a dataset with only 60,000 elements, I see this time and allocation count:

julia> @btime load_ipums_extract(ddi, "cps_00157.dat");
  118.328 ms (518521 allocations: 17.71 MiB)

Interestingly, when I tried using ipumsr to do this, the loading for both files was nearly instantaneous. I did some digging and found that hipread seems to be doing a lot of the heavy-lifting within ipumsr to do the parsing of this file.

Thoughts on Slowdowns

Per Krishna:

Most of the allocations must be happening in this part of the code, since this is where the data is getting loaded.

  for line in eachline(extract_filepath)
       data = map((x,p,d) -> _parse_data(x, p, d), 
                       [strip(line[r]) for r in range_vec], 
                       [eltype(a) for a in dtype_vec],
                       [d for d in dcml_vec])
       push!(df, data)
   end

After doing some further analysis using Profile.jl and PProf.jl, I was able to find the following flamegraph profiling on allocations in the code:

image

Although the image is not so great, one can see that a majority of the allocations occurs in the map function call as you were imagining, Krishna. Here's the code I ran to generate this flamegraph:

using IPUMS
using Profile
using PProf

ddi = parse_ddi("cps_00157.xml");

Profile.Allocs.clear()
Profile.Allocs.@profile load_ipums_extract(ddi, "cps_00157.dat");
PProf.Allocs.pprof()

I attached the allocation profile file here as well.

alloc-profile.pb.gz

I also did a quick analysis of the computation. It would appear that much of the time the function is running, it is trying its best at type inference. This flame graph is a bit inscrutable but I figured I would include it here as well:

image

Concluding Thoughts

In short, it seems like there is a huge opportunity to optimize the loading functionality to make it more straightforwardly take advantage of the fixed width files. Although, I am not entirely sure how to do that yet. I'd be curious your thoughts @00krishna.

Otherwise, great stuff!

00krishna commented 3 weeks ago

So I have created PR https://github.com/JuliaHealth/IPUMS.jl/pull/30 to help improve on the performance. @TheCedarPrince is reviewing this. We will work on further improvements as well.

TheCedarPrince commented 3 weeks ago

That was merged @00krishna -- thanks for adding this in. Are you still able to post that minimal working example to Discourse? I'd be curious to see how we could keep hacking at it someday to make it as fast as tidyr's reader.