HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
409 stars 77 forks source link

Resolve a memory leak by large data on out_queue (related to #494) #545

Closed YasushiMiyata closed 3 years ago

YasushiMiyata commented 3 years ago

Description of the problems or issues

Is your pull request related to a problem? Please describe. Fonduer accelerates parsing document with multi-processing. Each process gets documents from in_queue (shared memory), and puts parsed data and document name to out_queue (shared memory). This is well-known process, but possibility to hung up by memory leak of shared memory. Previous code put parsed (relatively large) data to out_queue. From the out_queue, other process get the data and commit it to postges DB. See also #494

Does your pull request fix any issue. See #494

Description of the proposed changes

Change out_queue input to only document name, not include parsed data. Instead of committing data with out_queue, each multi-thread process commits parsed data before putting document name to out_queue.

Test plan

Do existing test and monitor python memory usage. In my case (3000 html file, 12MB total), python memory usage reduce to 700MB from 1.4 GB.

Checklist

YasushiMiyata commented 3 years ago

Fix add & commit process on multi-thread. parser.py, labeler.py and featurizer.py generate data on multi-threads, but they did add & commit data on single process through out_queue. So, I update codes
from

doc(html) -- in_queue -- th1: parser -- out_queue(doc name, parsed data) -- writer(parsed data)
                      |- th2: parser -|
                      |- th3: parser -|
                      |- th4: parser -|

to

doc(html) -- in_queue -- th1: parser -- writer(data) -- out_queue(doc name)
                      |- th2: parser -- writer(data) -|
                      |- th3: parser -- writer(data) -|
                      |- th4: parser -- writer(data) -|

This cahnge reduces memory usage and prevents memory leaks because out_queue will have less data.

codecov-commenter commented 3 years ago

Codecov Report

Merging #545 (3d05728) into master (b1d72be) will increase coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #545   +/-   ##
=======================================
  Coverage   86.07%   86.08%           
=======================================
  Files          92       92           
  Lines        4776     4779    +3     
  Branches      899      899           
=======================================
+ Hits         4111     4114    +3     
  Misses        475      475           
  Partials      190      190           
Flag Coverage Δ
unittests 86.08% <100.00%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/fonduer/features/featurizer.py 86.02% <100.00%> (ø)
src/fonduer/parser/parser.py 93.41% <100.00%> (ø)
src/fonduer/supervision/labeler.py 70.37% <100.00%> (ø)
src/fonduer/utils/udf.py 88.88% <100.00%> (+0.31%) :arrow_up: