ericaltendorf / plotman

Chia plotting manager
Apache License 2.0
911 stars 280 forks source link

Failed plots from Chia GUI crash plotman interactive #631

Open CrossBread opened 3 years ago

CrossBread commented 3 years ago

Describe the bug I fired up interactive mode in a terminal last night and everything was looking good. I only have NVMe / SSD as temp and internal HDD as dest. Separately in the Chia GUI, I decided to experiment with plotting to an external HDD just giving it 1 thread. Mainly curious how bin count affects throughput on a platter drive. I go to bed, then two hours later the external HDD unmounted (I think it overheated), but that crashed Plotman Intereactive that I had left up in the terminal. No new jobs were scheduled all night long.

Why did the chia plot kicked off outside Plotman on a drive that doesn't appear in the yaml crash it. Is that intended?

To Reproduce

Steps to reproduce the behavior, e.g.:

  1. Set up config with normal plotting parameters
  2. Start a plot from the Chia GUI on a drive not in the plotman.yml
  3. Unmount the drive that the additional plot was running on
  4. See error:
Traceback (most recent call last):
  File "/home/username/chia-blockchain/venv/bin/plotman", line 8, in <module>
    sys.exit(main())
  File "/home/username/chia-blockchain/venv/lib/python3.8/site-packages/plotman/plotman.py", line 173, in main
    interactive.run_interactive()
  File "/home/username/chia-blockchain/venv/lib/python3.8/site-packages/plotman/interactive.py", line 334, in run_interactive
    curses.wrapper(curses_main)
  File "/usr/lib/python3.8/curses/__init__.py", line 105, in wrapper
    return func(stdscr, *args, **kwds)
  File "/home/username/chia-blockchain/venv/lib/python3.8/site-packages/plotman/interactive.py", line 261, in curses_main
    jobs_win.addstr(0, 0, reporting.status_report(jobs, n_cols, jobs_h, 
  File "/home/username/chia-blockchain/venv/lib/python3.8/site-packages/plotman/reporting.py", line 106, in status_report
    plot_util.human_format(j.get_tmp_usage(), 0),
  File "/home/username/chia-blockchain/venv/lib/python3.8/site-packages/plotman/job.py", line 351, in get_tmp_usage
    with os.scandir(self.tmpdir) as it:
FileNotFoundError: [Errno 2] No such file or directory: '/media/username/easystore/chia-plots'

Expected behavior Errors with plots scheduled on drives unknown to plotman shouldn't halt scheduling.

One drive disconnecting shouldn't halt scheduling for plotman. (If there are destination drives remaining.)

System setup:

Config

full configuration ```yaml # Default/example plotman.yaml configuration file # Options for display and rendering user_interface: # Call out to the `stty` program to determine terminal size, instead of # relying on what is reported by the curses library. In some cases, # the curses library fails to update on SIGWINCH signals. If the # `plotman interactive` curses interface does not properly adjust when # you resize the terminal window, you can try setting this to True. use_stty_size: True # Where to plot and log. directories: # One directory in which to store all plot job logs (the STDOUT/ # STDERR of all plot jobs). In order to monitor progress, plotman # reads these logs on a regular basis, so using a fast drive is # recommended. # log: /home/username/chia/logs log: /home/username/.chia/mainnet/plotter # One or more directories to use as tmp dirs for plotting. The # scheduler will use all of them and distribute jobs among them. # It assumes that IO is independent for each one (i.e., that each # one is on a different physical device). # # If multiple directories share a common prefix, reports will # abbreviate and show just the uniquely identifying suffix. tmp: - /home/username/plotter-1/chia-plot-temp - /media/username/plotter-2/chia-plot-temp - /media/username/plotter-3/chia-plot-temp - /media/username/plotter-4/chia-plot-temp - /media/username/ssd-os/home/username/ssd-chia-plot-temp # Optional: Allows overriding some characteristics of certain tmp # directories. This contains a map of tmp directory names to # attributes. If a tmp directory and attribute is not listed here, # it uses the default attribute setting from the main configuration. # # Currently support override parameters: # - tmpdir_max_jobs # tmp_overrides: # In this example, /mnt/tmp/00 is larger than the other tmp # dirs and it can hold more plots than the default. # "/mnt/tmp/00": # tmpdir_max_jobs: 5 # Optional: tmp2 directory. If specified, will be passed to # chia plots create as -2. Only one tmp2 directory is supported. # tmp2: /mnt/tmp/a # One or more directories; the scheduler will use all of them. # These again are presumed to be on independent physical devices, # so writes (plot jobs) and reads (archivals) can be scheduled # to minimize IO contention. dst: - /media/username/farmer-01/chia-plots - /media/username/farmer-02/chia-plots - /media/username/farmer-03/chia-plots - /media/username/farmer-04/chia-plots - /media/username/farmer-05/chia-plots - /media/username/farmer-06/chia-plots - /media/username/farmer-07/chia-plots - /media/username/farmer-08/chia-plots # Archival configuration. Optional; if you do not wish to run the # archiving operation, comment this section out. # # Currently archival depends on an rsync daemon running on the remote # host. # The archival also uses ssh to connect to the remote host and check # for available directories. Set up ssh keys on the remote host to # allow public key login from rsyncd_user. # Complete example: https://github.com/ericaltendorf/plotman/wiki/Archiving # archive: # rsyncd_module: plots # Define this in remote rsyncd.conf. # rsyncd_path: /plots # This is used via ssh. Should match path # # defined in the module referenced above. # rsyncd_bwlimit: 80000 # Bandwidth limit in KB/s # rsyncd_host: myfarmer # rsyncd_user: chia # # Optional index. If omitted or set to 0, plotman will archive # to the first archive dir with free space. If specified, # plotman will skip forward up to 'index' drives (if they exist). # This can be useful to reduce io contention on a drive on the # archive host if you have multiple plotters (simultaneous io # can still happen at the time a drive fills up.) E.g., if you # have four plotters, you could set this to 0, 1, 2, and 3, on # the 4 machines, or 0, 1, 0, 1. # index: 0 # Plotting scheduling parameters scheduling: # Run a job on a particular temp dir only if the number of existing jobs # before [tmpdir_stagger_phase_major : tmpdir_stagger_phase_minor] # is less than tmpdir_stagger_phase_limit. # Phase major corresponds to the plot phase, phase minor corresponds to # the table or table pair in sequence, phase limit corresponds to # the number of plots allowed before [phase major : phase minor]. # e.g, with default settings, a new plot will start only when your plot # reaches phase [2 : 1] on your temp drive. This setting takes precidence # over global_stagger_m tmpdir_stagger_phase_major: 2 tmpdir_stagger_phase_minor: 1 # Optional: default is 1 tmpdir_stagger_phase_limit: 1 # Don't run more than this many jobs at a time on a single temp dir. tmpdir_max_jobs: 3 # Don't run more than this many jobs at a time in total. # Setting 6 because each plotting drive (2 currently) has room for 3, maybe 4 if optimized # global_max_jobs: 0 global_max_jobs: 15 # Don't run any jobs (across all temp dirs) more often than this, in minutes. # (default was 30) global_stagger_m: 10 # How often the daemon wakes to consider starting a new plot job, in seconds. polling_time_s: 60 # Plotting parameters. These are pass-through parameters to chia plots create. # See documentation at # https://github.com/Chia-Network/chia-blockchain/wiki/CLI-Commands-Reference#create plotting: k: 32 e: False # Use -e plotting option n_threads: 2 # Threads per job n_buckets: 128 # Number of buckets to split data into job_buffer: 3389 # Per job memory (default: 3389) # If specified, pass through to the -f and -p options. See CLI reference. # farmer_pk: ... # pool_pk: ... ```

Additional context & screenshots

CrossBread commented 3 years ago

I'm curious if I'm using Plotman Interactive wrong and maybe should be using Plotman Plot for unattended plotting?

altendky commented 3 years ago

I think it's just because you have a not-present directory listed in your dst config section. Though sure, this isn't a nice response to that.

CrossBread commented 3 years ago

@altendky Oops, scrolled up too far and sent the wrong log initially. That was from last night when I hadn't mounted the new drives when I tried to start. The drives under dst are all internal HDD (plotter-X). I've updated the issue with the correct log that was displayed in the terminal this morning where I had been running plotman last night.

CrossBread commented 3 years ago

@ericaltendorf You had asked for a stack trace on Keybase, so I'll follow up here and add that I wonder if the fact that I was logging into the .chia/mainnet/plotter dir had something to do with it.

I did that so plotman could be aware of other jobs running when considering global jobs to keep under my RAM limit. But maybe it's not distinguishing that the log observed doesn't match the temp or destination drives in the yaml, and therefore shouldn't be responsible to fail-fast if there is a problem with the observed job.

ericaltendorf commented 3 years ago

Thanks for the stacktrace and filing an issue. Sounds like the basic problem is if one of the dirs plotman depends on disappears (ie the drive gets unmounted) we die instead of cleanly recovering?

CrossBread commented 3 years ago

Hi sorry, I thought I subscribed to this issue, but I've somehow subscribed to all activity in the repo and lost your reply in the noise.

It's possible that is the case, but I think what I'm observing was slightly different.

I was using the Chia GUI to experiment with plotting to an external hard drive. That was logging to a default location in .chia/mainnet/plotter.

Separately, I had plot man configured to log into that same directory, because I noticed it would scan the logs of plots from other sources. So that way I could keep an eye on the experimental plots, and let plotman take them into account when scheduling.

Some kind of error happened with the external hard drive and the mount got really messed up. I eventually had to force unmount it. Any calls to stat that drive were locking up processes.

So it's possible that was related. Even just trying to list the root directory of the drive with ls /dev/sdm would just hang forever. Trying to check the smart data and grab temperature for instance would hang forever.

So if plotman is doing any of that under the hood, maybe it was stuck waiting on a hung process.