TeX-Live / texlive-source

source part of the TeX Live subversion repository - for issues please contact the tex-k mailing list at tug.org
257 stars 67 forks source link

Parallelize fmtutil #69

Closed marmitar closed 4 months ago

marmitar commented 4 months ago

fmtutil takes 168 seconds on a full system update, with 160 seconds being on calls to system in line 810, which calls other programs like pdftex and xetex. This line is called 54 times, with an average of 2.85 seconds per call. This is not that bad, but since each system call blocks execution until the other program is complete, the total running time adds up.

Instead, I think fmtutil could be parallelized, resulting in a faster execution overall. I did a simple proof of concept using Parallel::ForkManager and the results look promising. Total running time went down to 40 seconds.

Proof of concept

I updated lines 472 to 503 of callback_build_formats with the following:

  my $nproc = 24;
  my $pm = Parallel::ForkManager->new($nproc);
  $pm->run_on_finish(sub {
    my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_structure_reference) = @_;
    my ($fmt, $eng) = @{$data_structure_reference};
    if ($exit_code == $FMT_DISABLED)    {
      log_to_status("DISABLED", $fmt, $eng, $what, $whatarg);
      $disabled++;
    } elsif ($exit_code == $FMT_NOTSELECTED) {
      log_to_status("NOTSELECTED", $fmt, $eng, $what, $whatarg);
      $nobuild++;
    } elsif ($exit_code == $FMT_FAILURE)  {
      log_to_status("FAILURE", $fmt, $eng, $what, $whatarg);
      $err++;
      push (@err, "$eng/$fmt");
    } elsif ($exit_code == $FMT_SUCCESS)  {
      log_to_status("SUCCESS", $fmt, $eng, $what, $whatarg);
      $suc++;
    } elsif ($exit_code == $FMT_NOTAVAIL) {
      log_to_status("NOTAVAIL", $fmt, $eng, $what, $whatarg);
      $notavail++;
    }
    else {
      log_to_status("UNKNOWN", $fmt, $eng, $what, $whatarg);
      print_error("callback_build_format (round 1): unknown return "
        . "from select_and_rebuild.\n");
    }
  });
  for my $swi (qw/format=engine format!=engine/) {
    for my $fmt (keys %{$alldata->{'merged'}}) {
      for my $eng (keys %{$alldata->{'merged'}{$fmt}}) {
        next if ($swi eq "format=engine" && $fmt ne $eng);
        next if ($swi eq "format!=engine" && $fmt eq $eng);
        $total++;
        $pm->start("select_and_rebuild_format($fmt, $eng, $what, $whatarg)") and next;
        my $val = select_and_rebuild_format($fmt, $eng, $what, $whatarg);
        my @array = ($fmt, $eng);
        $pm->finish($val, \@array);
      }
    }
  }
  $pm->wait_all_children;

Possible problems

First off, I did not check the code thoroughly for data races and race condition, although it does look okay.

Second, the output is garbled to the point it is useless. This could be fixed somehow, but it might too hard for me. Another solution would be to enable or disable parallelism via command line, for systems that don't care about the output (e.g, Arch Linux).

kberry commented 4 months ago

Thank you for the suggestion and draft patch. We can't assume Parallel::ForkManager is installed, but I suppose we could check if it is available and continue to operate serially if not, so that's not a big deal.

However, garbling the output is, IMHO, a stopper. It's already difficult enough to debug format-generation problems, despite all our efforts at outputting the necessary information. With output intermixed between formats, it would be impossible, seems to me.

Also, it seems to me that no system should "not care" about this. As long all the formats build, of course it doesn't matter, but as soon as there is a failure, it becomes critical. The rarity of failures makes it even more important that the output is useful, since it's usually nontrivial to reproduce failures.

Thanks, karl

marmitar commented 4 months ago

Okay, it makes sense. I'll try to get some ordering back in the output. My idea is to turn the print_* functions into "process buffered", instead of just line buffered as print is. This way, each child process collects its output while running, and prints everything at once before exiting. This means that each format still gets its output printed together, but the order related to other formats is not guaranteed. Would that be okay?

Another thing, data races should not be possible, as Parallel::ForkManager does not share memory by default. This also mean that modifications to global variables are only seen in that process, so the print_deferred_* functions don't do anything in my draft patch. I would need to fix that too.

About Arch Linux's package, I will open an issue on their Gitlab to let them know their practice is not recommended. They usually prefer to do things like upstream does, without too much deviation. I just need to know where is the recommended location for fmtutil logs.

kberry commented 4 months ago
This means that each format still gets its output printed together,
but the order related to other formats is not guaranteed. Would that
be okay?

It sounds ok to me. Norbert, wdyt?

I just need to know where is the recommended location for fmtutil logs.

See the print_info() and related routines. Your code should ultimately use those functions.

In short, in "mktexfmt" mode, messages go to stderr; otherwise, in "normal" mode, they go to stdout.

fmtutil is called by both tlmgr and install-tl, and each of those programs saves the fmtutil output in its own location. So the logs should not be written directly to a file. --thanks, karl.

norbusan commented 4 months ago

This means that each format still gets its output printed together, but the order related to other formats is not guaranteed. Would that be okay? It sounds ok to me. Norbert, wdyt?

Yes, the order is not of importance. Even a completely messed up situation would still work, it is mostly for human consumption.

norbusan commented 4 months ago

@TiagodePAlves I have committed your code with the module loading (for now unconditionally) to a development branch, see here: https://git.texlive.info/texlive/tree/Master/texmf-dist/scripts/texlive/fmtutil.pl?h=dev/parallel-fmtutil#n473

I don't suggest cloning this repo (50+G), I will probably set up a github texlive-infra repo which mirrors most of those files.

norbusan commented 4 months ago

@TiagodePAlves here we go, I have created a repo texlive-infra that syncs the main files to the master branch every 15min. For now I have added your code in a branch, see here: https://github.com/TeX-Live/texlive-infra/blob/parallel-infra/texlive-scripts/fmtutil.pl#L473-L514 We cannot merge here, but development and sharing of code should work.

norbusan commented 4 months ago

Please see https://github.com/TeX-Live/texlive-infra/pull/1/files and comment on the changes I have made the changes into a workable layout.

kberry commented 4 months ago

let's close this and consider further in texlive-infra, since you went to all the trouble of setting it up.

marmitar commented 4 months ago

Yeah, sorry, I'm a bit short on time right now, but I'll try to test/review the changes on texlive-infra.