gettalong / hexapdf

Versatile PDF creation and manipulation for Ruby
https://hexapdf.gettalong.org
Other
1.22k stars 69 forks source link

Reduce string memory usage on Tokenizer#prepare_string_scanner #319

Closed rainerborene closed 1 month ago

rainerborene commented 1 month ago

I was able to reduce the memory usage by 5% on Tokenizer#prepare_string_scanner method and reduce some String object allocations as well. Here is the script I used to benchmark this change with memory_profiler gem:

doc = HexaPDF::Document.new io: File.open(File.expand_path("./big_pdf.pdf"))
doc.pages.each do |page|
  canvas = page.canvas(type: :overlay)
  canvas.font("Helvetica", size: 8, variant: :italic)
  canvas.text("Something", at: [10, 10])
end
doc.write "test.pdf", incremental: true, validate: false

Before:

Total allocated: 104.17 MB (1025017 objects)
Total retained:  18.14 MB (119724 objects)

After:

Total allocated: 99.69 MB (1024478 objects)
Total retained:  18.15 MB (119726 objects)
rainerborene commented 1 month ago

Thank you for the pull request! I have added some comments and will benchmark the change as a whole later on.

Could you please sign the CLA - see https://hexapdf.gettalong.org/contributing.html - so that I can incorporate the changes into HexaPDF?

I've signed the CLA (see your inbox), and pushed the suggested changes. 🚀

Edit: I renamed the instance variable @io_partial to @io_chunk which fits better on this context.

gettalong commented 1 month ago

Running my usual real world benchmarks I didn't find any specific memory savings or performance improvements due to this change. However, there is a certainly a difference for larger PDF files when running the simple memory benchmark:

ruby -Ilib:. -r prof_memory -r hexapdf -e "HexaPDF::Document.open(path_to_file') {|doc| doc.each(only_current: false) {|o| } }"

(prof_memory is just a simple wrapper for using the memory_profiler gem).

Before:

Total allocated: 1.74 GB (14943439 objects)
Total retained:  456.19 kB (4842 objects)

After:

Total allocated: 1.24 GB (14884174 objects)
Total retained:  456.27 kB (4844 objects)

There is a difference of ~60.000 objects. We would expect these objects to be the ones saved at https://github.com/gettalong/hexapdf/pull/319/files#diff-e750dfc750e9c877f39d4174d2b388352eb139a5b1e9f6911edba0ff25afa659R443. And since those objects contain 8.192 we roughly get the difference in memory usage of ~500MB.

rainerborene commented 1 month ago

You might want to try the same snippet of code in the PR description for the benchmark. It might change the results. I did a test on my computer with a 15MB PDF and there's a minor improvement.

gettalong commented 1 month ago

@rainerborene I have merged your changes and pushed to the devel branch - see https://github.com/gettalong/hexapdf/commit/3aeec254b1b4329d033a8318e50d6db5709c7b33