I just realised that for an array with 1000 strings with each 50-300 chars length (url titles and description generated by gottfrois/link_thumbnailer), the following causes a much higher memory load…
array.map {|string| PragmaticSegmenter::Segmenter.new(text: string, language: 'de').segment }
…than this here:
PragmaticSegmenter::Segmenter.new(text: array.join('\r'), language: 'de').segment
In my tests it's a 30-50MB difference, I assume objects inside a #map will not get garbage collected sequentially but all at once, when the entire array has been mapped.
@diasks2 would you consider updating the API to also support:
ps = PragmaticSegmenter::Segmenter.new(language: 'de'); array.map {|string| ps.segment(string) }
…which would allow to reuse the Segmenter object and will most likely reduce memory load? It would be possible to support the old API as well, by additionally allowing initialisation without a passed text and adding an optional argument to #segment.
As a side note, I've noticed lots of #gsub which probably can be replaced with #gsub! to reduce the strain on the garbage collector. I'll submit a PR whenever I ever get to it, unfortunately my current work load only allows me to report the issue and not much more.
I just realised that for an
array
with 1000 strings with each 50-300 chars length (url titles and description generated by gottfrois/link_thumbnailer), the following causes a much higher memory load…array.map {|string| PragmaticSegmenter::Segmenter.new(text: string, language: 'de').segment }
…than this here:PragmaticSegmenter::Segmenter.new(text: array.join('\r'), language: 'de').segment
In my tests it's a 30-50MB difference, I assume objects inside a
#map
will not get garbage collected sequentially but all at once, when the entire array has been mapped.@diasks2 would you consider updating the API to also support:
ps = PragmaticSegmenter::Segmenter.new(language: 'de'); array.map {|string| ps.segment(string) }
…which would allow to reuse the Segmenter object and will most likely reduce memory load? It would be possible to support the old API as well, by additionally allowing initialisation without a passed text and adding an optional argument to#segment
.As a side note, I've noticed lots of
#gsub
which probably can be replaced with#gsub!
to reduce the strain on the garbage collector. I'll submit a PR whenever I ever get to it, unfortunately my current work load only allows me to report the issue and not much more.Thanks!