KSP-CKAN / NetKAN-Infra

NetKAN Infrastructure Repo
MIT License
3 stars 6 forks source link

[Bug] Scheduler failing CKAN-meta checkout #279

Closed HebaruSan closed 1 year ago

HebaruSan commented 1 year ago

Problem

The Scheduler has been printing this to Discord every 30 minutes starting about 7 hour ago (which means the bot is no longer running):

Uncaught exception:
Traceback (most recent call last):
  File ".local/bin/netkan", line 8, in <module>
    sys.exit(netkan())
  File "/home/netkan/.local/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/netkan/.local/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/netkan/.local/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/netkan/.local/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/netkan/.local/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/netkan/.local/lib/python3.7/site-packages/click/decorators.py", line 84, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/netkan/.local/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/netkan/.local/lib/python3.7/site-packages/netkan/cli/services.py", line 60, in scheduler
    common.netkan_repo, common.ckanmeta_repo, common.queue, common.token,
  File "/home/netkan/.local/lib/python3.7/site-packages/netkan/cli/common.py", line 68, in __getattribute__
    attr = super().__getattribute__(name)
  File "/home/netkan/.local/lib/python3.7/site-packages/netkan/cli/common.py", line 101, in ckanmeta_repo
    init_repo(self._ckanmeta_remote, '/tmp/CKAN-meta', self.deep_clone))
  File "/home/netkan/.local/lib/python3.7/site-packages/netkan/utils.py", line 14, in init_repo
    Repo.clone_from(metadata, clone_path, depth=1))
  File "/home/netkan/.local/lib/python3.7/site-packages/git/repo/base.py", line 1275, in clone_from
    return cls._clone(git, url, to_path, GitCmdObjectDB, progress, multi_options, **kwargs)
  File "/home/netkan/.local/lib/python3.7/site-packages/git/repo/base.py", line 1194, in _clone
    finalize_process(proc, stderr=stderr)
  File "/home/netkan/.local/lib/python3.7/site-packages/git/util.py", line 419, in finalize_process
    proc.wait(**kwargs)
  File "/home/netkan/.local/lib/python3.7/site-packages/git/cmd.py", line 559, in wait
    raise GitCommandError(remove_password_if_present(self.args), status, errstr)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git clone -v --depth=1 git@github.com:KSP-CKAN/CKAN-meta.git /tmp/CKAN-meta
  stderr: 'Cloning into '/tmp/CKAN-meta'...
Updating files:  35% (9553/26646)
Updating files:  36% (9593/26646)
Updating files:  37% (9860/26646)
Updating files:  38% (10126/26646)
Updating files:  39% (10392/26646)
Updating files:  40% (10659/26646)
Updating files:  41% (10925/26646)
Updating files:  42% (11192/26646)
Updating files:  43% (11458/26646)
Updating files:  43% (11527/26646)
Updating files:  44% (11725/26646)
Updating files:  45% (11991/26646)
Updating files:  45% (11994/26646)
Updating files:  45% (12157/26646)
Updating files:  46% (12258/26646)
Updating files:  46% (12440/26646)
Updating files:  47% (12524/26646)
Updating files:  47% (12542/26646)
Updating files:  48% (12791/26646)
Updating files:  49% (13057/26646)
Updating files:  49% (13077/26646)
Updating files:  50% (13323/26646)
Updating files:  51% (13590/26646)
Updating files:  51% (13659/26646)
Updating files:  51% (13768/26646)
Updating files:  52% (13856/26646)
Updating files:  52% (14122/26646)
Updating files:  53% (14123/26646)
Updating files:  54% (14389/26646)
Updating files:  55% (14656/26646)
Updating files:  55% (14696/26646)
Updating files:  56% (14922/26646)
Updating files:  56% (14929/26646)
Updating files:  56% (15026/26646)
Updating files:  57% (15189/26646)
Updating files:  57% (15230/26646)
Updating files:  58% (15455/26646)
Updating files:  59% (15722/26646)
Updating files:  60% (15988/26646)
Updating files:  60% (16206/26646)
Updating files:  61% (16255/26646)
Updating files:  61% (16328/26646)
Updating files:  62% (16521/26646)
Updating files:  63% (16787/26646)
Updating files:  63% (16886/26646)
Updating files:  64% (17054/26646)
Updating files:  64% (17192/26646)
Updating files:  64% (17218/26646)
Updating files:  65% (17320/26646)
Updating files:  65% (17341/26646)
Updating files:  65% (17352/26646)
Updating files:  66% (17587/26646)
Updating files:  66% (17612/26646)
Updating files:  66% (17618/26646)
Updating files:  67% (17853/26646)
Updating files:  67% (17944/26646)
Updating files:  67% (17991/26646)
Updating files:  67% (18009/26646)
Updating files:  68% (18120/26646)
Updating files:  68% (18173/26646)
Updating files:  69% (18386/26646)
Updating files:  69% (18456/26646)
Updating files:  69% (18537/26646)
Updating files:  70% (18653/26646)
Updating files:  70% (18697/26646)
Updating files:  70% (18779/26646)
Updating files:  70% (18871/26646)
Updating files:  71% (18919/26646)
Updating files:  71% (19106/26646)
Updating files:  72% (19186/26646)
Updating files:  72% (19191/26646)
Updating files:  72% (19218/26646)
Updating files:  72% (19264/26646)
Updating files:  72% (19279/26646)
Updating files:  72% (19301/26646)
Updating files:  72% (19340/26646)
Updating files:  72% (19370/26646)
Updating files:  72% (19410/26646)
Updating files:  72% (19443/26646)
Updating files:  73% (19452/26646)
Updating files:  73% (19483/26646)
Updating files:  73% (19502/26646)
Updating files:  73% (19508/26646)
Updating files:  73% (19534/26646)
Updating files:  73% (19569/26646)
Updating files:  73% (19591/26646)
Updating files:  73% (19614/26646)
Updating files:  73% (19632/26646)
Updating files:  73% (19646/26646)
Updating files:  73% (19672/26646)
Updating files:  74% (19719/26646)
Updating files:  74% (19816/26646)
Updating files:  74% (19942/26646)
Updating files:  75% (19985/26646)
Updating files:  75% (20021/26646)
Updating files:  75% (20046/26646)
Updating files:  75% (20058/26646)
Updating files:  75% (20114/26646)
Updating files:  75% (20129/26646)
Updating files:  75% (20165/26646)
Updating files:  75% (20248/26646)
Updating files:  76% (20251/26646)
Updating files:  76% (20271/26646)
Updating files:  76% (20308/26646)
Updating files:  76% (20325/26646)
Updating files:  76% (20353/26646)
Updating files:  76% (20386/26646)
Updating files:  76% (20481/26646)
Updating files:  77% (20518/26646)
Updating files:  77% (20530/26646)
Updating files:  77% (20553/26646)
Updating files:  77% (20576/26646)
Updating files:  77% (20633/26646)
Updating files:  77% (20676/26646)
Updating files:  77% (20766/26646)
Updating files:  78% (20784/26646)
Updating files:  78% (20811/26646)
Updating files:  78% (20820/26646)
Updating files:  78% (20834/26646)
Updating files:  78% (20856/26646)
Updating files:  78% (20871/26646)
Updating files:  78% (20904/26646)
Updating files:  78% (20921/26646)
Updating files:  78% (20950/26646)
Updating files:  78% (21013/26646)
Updating files:  78% (21029/26646)
Updating files:  79% (21051/26646)
Updating files:  79% (21061/26646)
Updating files:  79% (21070/26646)
Updating files:  79% (21085/26646)
Updating files:  79% (21094/26646)
Updating files:  79% (21109/26646)
Updating files:  79% (21128/26646)
Updating files:  79% (21147/26646)
Updating files:  79% (21189/26646)
Updating files:  79% (21205/26646)
Updating files:  79% (21238/26646)
Updating files:  79% (21254/26646)
Updating files:  79% (21278/26646)
Updating files:  80% (21317/26646)
Updating files:  80% (21326/26646)
Updating files:  80% (21350/26646)
Updating files:  80% (21397/26646)
Updating files:  80% (21405/26646)
Updating files:  80% (21441/26646)
Updating files:  80% (21451/26646)
Updating files:  80% (21472/26646)
Updating files:  80% (21532/26646)
Updating files:  81% (21584/26646)
Updating files:  81% (21588/26646)
Updating files:  81% (21661/26646)
Updating files:  81% (21691/26646)
Updating files:  81% (21723/26646)
Updating files:  81% (21748/26646)
Updating files:  81% (21766/26646)
Updating files:  81% (21788/26646)
Updating files:  81% (21820/26646)
Updating files:  81% (21838/26646)
Updating files:  82% (21850/26646)
Updating files:  82% (21863/26646)
Updating files:  82% (21877/26646)
Updating files:  82% (21890/26646)
Updating files:  82% (21924/26646)
Updating files:  82% (21945/26646)
Updating files:  82% (22047/26646)
Updating files:  82% (22071/26646)
Updating files:  82% (22104/26646)
Updating files:  83% (22117/26646)
Updating files:  83% (22171/26646)
Updating files:  83% (22198/26646)
Updating files:  83% (22242/26646)
Updating files:  83% (22268/26646)
Updating files:  83% (22287/26646)
Updating files:  83% (22327/26646)
Updating files:  83% (22361/26646)
Updating files:  84% (22383/26646)
Updating files:  84% (22409/26646)
Updating files:  84% (22447/26646)
Updating files:  84% (22498/26646)
Updating files:  84% (22542/26646)
Updating files:  84% (22560/26646)
Updating files:  84% (22586/26646)
Updating files:  84% (22623/26646)
Updating files:  84% (22648/26646)
Updating files:  85% (22650/26646)
Updating files:  85% (22682/26646)
Updating files:  85% (22692/26646)
Updating files:  85% (22719/26646)
Updating files:  85% (22729/26646)
Updating files:  85% (22738/26646)
Updating files:  85% (22785/26646)
Updating files:  85% (22803/26646)
Updating files:  85% (22815/26646)
Updating files:  85% (22827/26646)
Updating files:  85% (22845/26646)
Updating files:  85% (22853/26646)
Updating files:  85% (22871/26646)
Updating files:  85% (22881/26646)
Updating files:  85% (22915/26646)
Updating files:  86% (22916/26646)
Updating files:  86% (22961/26646)
Updating files:  86% (22974/26646)
Updating files:  86% (23008/26646)
Updating files:  86% (23021/26646)
Updating files:  86% (23044/26646)
Updating files:  86% (23080/26646)
Updating files:  86% (23125/26646)
Updating files:  86% (23133/26646)
Updating files:  86% (23161/26646)
Updating files:  86% (23173/26646)
Updating files:  87% (23183/26646)
Updating files:  87% (23184/26646)
Updating files:  87% (23229/26646)
Updating files:  87% (23244/26646)
Updating files:  87% (23258/26646)
Updating files:  87% (23278/26646)
Updating files:  87% (23299/26646)
Updating files:  87% (23320/26646)
Updating files:  87% (23365/26646)
Updating files:  87% (23389/26646)
Updating files:  87% (23403/26646)
Updating files:  87% (23446/26646)
Updating files:  88% (23449/26646)
Updating files:  88% (23472/26646)
Updating files:  88% (23500/26646)
Updating files:  88% (23517/26646)
Updating files:  88% (23529/26646)
Updating files:  88% (23557/26646)
Updating files:  88% (23572/26646)
Updating files:  88% (23603/26646)
Updating files:  88% (23627/26646)
Updating files:  88% (23645/26646)
Updating files:  88% (23654/26646)
Updating files:  88% (23680/26646)
Updating files:  88% (23692/26646)
Updating files:  88% (23708/26646)
Updating files:  89% (23715/26646)
Updating files:  89% (23760/26646)
Updating files:  89% (23773/26646)
Updating files:  89% (23792/26646)
Updating files:  89% (23840/26646)
Updating files:  89% (23896/26646)
Updating files:  89% (23944/26646)
Updating files:  90% (23982/26646)
Updating files:  90% (23988/26646)
Updating files:  90% (24041/26646)
Updating files:  90% (24060/26646)
Updating files:  90% (24078/26646)
Updating files:  90% (24099/26646)
Updating files:  90% (24134/26646)
Updating files:  90% (24173/26646)
Updating files:  90% (24209/26646)
Updating files:  90% (24229/26646)
Updating files:  91% (24248/26646)
Updating files:  91% (24256/26646)
Updating files:  91% (24309/26646)
Updating files:  91% (24329/26646)
Updating files:  91% (24338/26646)
Updating files:  91% (24365/26646)
Updating files:  91% (24390/26646)
Updating files:  91% (24425/26646)
Updating files:  91% (24444/26646)
Updating files:  91% (24463/26646)
Updating files:  91% (24477/26646)
Updating files:  91% (24487/26646)
Updating files:  92% (24515/26646)
Updating files:  92% (24535/26646)
Updating files:  92% (24581/26646)
Updating files:  92% (24606/26646)
Updating files:  92% (24630/26646)
Updating files:  92% (24662/26646)
Updating files:  92% (24710/26646)
Updating files:  92% (24759/26646)
Updating files:  93% (24781/26646)
Updating files:  93% (24784/26646)
Updating files:  93% (24842/26646)
Updating files:  93% (24858/26646)
Updating files:  93% (24890/26646)
Updating files:  93% (24900/26646)
Updating files:  93% (24923/26646)
Updating files:  93% (24947/26646)
Updating files:  93% (24997/26646)
Updating files:  93% (25016/26646)
Updating files:  93% (25032/26646)
Updating files:  94% (25048/26646)
Updating files:  94% (25092/26646)
Updating files:  94% (25141/26646)
Updating files:  94% (25250/26646)
Updating files:  94% (25289/26646)
Updating files:  95% (25314/26646)
Updating files:  95% (25326/26646)
Updating files:  95% (25360/26646)
Updating files:  95% (25401/26646)
Updating files:  95% (25406/26646)
Updating files:  95% (25443/26646)
Updating files:  95% (25493/26646)
Updating files:  95% (25504/26646)
Updating files:  95% (25519/26646)
Updating files:  95% (25538/26646)
Updating files:  95% (25552/26646)
Updating files:  95% (25565/26646)
Updating files:  96% (25581/26646)
Updating files:  96% (25585/26646)
Updating files:  96% (25602/26646)
Updating files:  96% (25622/26646)
Updating files:  96% (25685/26646)
Updating files:  96% (25713/26646)
Updating files:  96% (25759/26646)
Updating files:  96% (25785/26646)
Updating files:  96% (25816/26646)
Updating files:  97% (25847/26646)
Updating files:  97% (25849/26646)
Updating files:  97% (25868/26646)
Updating files:  97% (25901/26646)
Updating files:  97% (25955/26646)
Updating files:  97% (25989/26646)
Updating files:  97% (26022/26646)
Updating files:  98% (26114/26646)
Updating files:  98% (26115/26646)
Updating files:  98% (26173/26646)
Updating files:  98% (26208/26646)
Updating files:  98% (26224/26646)
Updating files:  98% (26254/26646)
Updating files:  98% (26273/26646)
Updating files:  98% (26290/26646)
Updating files:  98% (26320/2664
Updating files:  98% (26336/26646)
Updating files:  98% (26351/26646)
Updating files:  98% (26374/26646)
Updating files:  99% (26380/26646)
Updating files:  99% (26449/26646)
Updating files:  99% (26475/26646)
Updating files:  99% (26552/26646)
Updating files:  99% (26590/26646)
Updating files: 100% (26646/26646)
Updating files: 100% (26646/26646), done.
fatal: unable to write new index file
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

'

Maybe it's running out of disk space?

HebaruSan commented 1 year ago

Is this where the disk size is set?

https://github.com/KSP-CKAN/NetKAN-Infra/blob/4869086b9279b652d5eb21ecebabf4361e9a04d9/prod-stack.py#L662

du on my local disk indicates that CKAN-meta is larger than 100 MB. CKAN-meta:

$ du -sh
217M    .

NetKAN:

$ du -sh
126M    .
HebaruSan commented 1 year ago

Hmm, looks like that would be 100 GB, not MB:

https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_EbsBlockDevice.html

VolumeSize (request), volumeSize (response) The size of the volume, in GiBs. You must specify either a snapshot ID or a volume size. If you specify a snapshot, the default is the snapshot size. You can specify a volume size that is equal to or larger than the snapshot size.

HebaruSan commented 1 year ago

It has also been skipping some runs.

Run skipped, below volume credit target percentage (Current Avg: 0)

HebaruSan commented 1 year ago

The Mirrorer just came up with a new one:

Uncaught exception:
Traceback (most recent call last):
  File ".local/bin/netkan", line 8, in <module>
    sys.exit(netkan())
  File "/home/netkan/.local/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/netkan/.local/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/netkan/.local/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/netkan/.local/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/netkan/.local/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/netkan/.local/lib/python3.7/site-packages/click/decorators.py", line 84, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/netkan/.local/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/netkan/.local/lib/python3.7/site-packages/netkan/cli/services.py", line 76, in mirrorer
    ).process_queue(common.queue, common.timeout)
  File "/home/netkan/.local/lib/python3.7/site-packages/netkan/mirrorer.py", line 321, in process_queue
    if self.try_mirror(CkanMirror(self.ia_collection, path)):
  File "/home/netkan/.local/lib/python3.7/site-packages/netkan/mirrorer.py", line 353, in try_mirror
    download_stream_to_file(source_url, tmp)
  File "/home/netkan/.local/lib/python3.7/site-packages/netkan/common.py", line 54, in download_stream_to_file
    dest_file.write(chunk)
  File "/usr/local/lib/python3.7/tempfile.py", line 481, in func_wrapper
    return func(*args, **kwargs)
OSError: [Errno 12] Cannot allocate memory
HebaruSan commented 1 year ago

Tried merging all the pending CKAN-meta PRs to see if that reduces the needed disk space.

After a local git pull and git gc:

$ du -sh NetKAN CKAN-meta
56M     NetKAN
209M    CKAN-meta

So maybe we're using some kind of default disk volume 256 MB in size?

... It's still failing, same error.

HebaruSan commented 1 year ago

Run skipped, below volume credit target percentage (Current Avg: 0)

Hmm, these seem to have happened on passes when we didn't get the error in the OP. Do we clone the repos before we check the credits? It would make more sense to check the credits first, since that check can stop the whole run.

... confirmed, the common.*_repo calls clone the repos, before can_schedule is called:

https://github.com/KSP-CKAN/NetKAN-Infra/blob/4869086b9279b652d5eb21ecebabf4361e9a04d9/netkan/netkan/cli/services.py#L58-L66

Cleaning that up might allow faster recovery of credits, since we'd be doing ~260 MB less network activity per aborted run.