torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGKILL

cvg / glue-factory

Training library for local feature detection and matching

Apache License 2.0

756 stars 98 forks source link

torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGKILL #19

Closed GONGXI1994 closed 1 year ago

GONGXI1994 commented 1 year ago

when I train lightGlue using python -m gluefactory.train sp+lg_megadepth \ --conf gluefactory/configs/superpoint-open+lightglue_megadepth.yaml \ train.load_experiment=sp+lg_homography \ data.load_features.do=True --distributed

process killed after [10/17/2023 04:26:12 gluefactory INFO] [E 4 | it 1000] loss {total 1.731E+00, last 7.856E-01, assignment_nll 7.856E-01, nll_pos 1.262E+00, nll_neg 3.087E-01, num_matchable 4.165E+02, num_unmatchable 7.160E+02, confidence 2.601E-01, row_norm 8.259E-01} . Can you offer me some advice to solve this problem? Thanks ~

sarlinpe commented 1 year ago

How does the RAM usage evolve throughout the training on MegaDepth?

GONGXI1994 commented 1 year ago

How does the RAM usage evolve throughout the training on MegaDepth?

about 50% (total 125G). when I begin training on MegaDepth. But when the training crushed, I have not inspect the RAM usage.

GONGXI1994 commented 1 year ago

max RAM usage : 98% before training crushed

GONGXI1994 commented 1 year ago

When I set the "conf.plot == None" in Function " do_evaluation()" , everything goes OK. the max RAM usage reduce to 75%. Thank for your great Job!!

sarlinpe commented 1 year ago

I have optimized how we handle figures during training in PR https://github.com/cvg/glue-factory/pull/30, does this help?