Closed YasushiMiyata closed 3 years ago
Fix add & commit process on multi-thread. parser.py
, labeler.py
and featurizer.py
generate data on multi-threads, but they did add & commit data on single process through out_queue. So, I update codes
from
doc(html) -- in_queue -- th1: parser -- out_queue(doc name, parsed data) -- writer(parsed data)
|- th2: parser -|
|- th3: parser -|
|- th4: parser -|
to
doc(html) -- in_queue -- th1: parser -- writer(data) -- out_queue(doc name)
|- th2: parser -- writer(data) -|
|- th3: parser -- writer(data) -|
|- th4: parser -- writer(data) -|
This cahnge reduces memory usage and prevents memory leaks because out_queue will have less data.
Merging #545 (3d05728) into master (b1d72be) will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #545 +/- ##
=======================================
Coverage 86.07% 86.08%
=======================================
Files 92 92
Lines 4776 4779 +3
Branches 899 899
=======================================
+ Hits 4111 4114 +3
Misses 475 475
Partials 190 190
Flag | Coverage Δ | |
---|---|---|
unittests | 86.08% <100.00%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
src/fonduer/features/featurizer.py | 86.02% <100.00%> (ø) |
|
src/fonduer/parser/parser.py | 93.41% <100.00%> (ø) |
|
src/fonduer/supervision/labeler.py | 70.37% <100.00%> (ø) |
|
src/fonduer/utils/udf.py | 88.88% <100.00%> (+0.31%) |
:arrow_up: |
Description of the problems or issues
Is your pull request related to a problem? Please describe. Fonduer accelerates parsing document with multi-processing. Each process gets documents from
in_queue
(shared memory), and puts parsed data and document name toout_queue
(shared memory). This is well-known process, but possibility to hung up by memory leak of shared memory. Previous code put parsed (relatively large) data toout_queue
. From theout_queue
, other process get the data and commit it to postges DB. See also #494Does your pull request fix any issue. See #494
Description of the proposed changes
Change
out_queue
input to only document name, not include parsed data. Instead of committing data without_queue
, each multi-thread process commits parsed data before putting document name toout_queue
.Test plan
Do existing test and monitor python memory usage. In my case (3000 html file, 12MB total), python memory usage reduce to 700MB from 1.4 GB.
Checklist