Parallelize rc training generation

The slowest steps in training are the generation of the reservoir state sequence and the regressor matrix update. The former is already parallelized over subdomains, while the latter currently uses a for loop over subdomains. This PR parallelizes the matrix update for each subdomain's regressor to cut down time one of the slowest parts of training on each batch. Using n_jobs=4 reduces the time by roughly half.

Coverage reports (updated automatically):

ai2cm / fv3net

Parallelize rc training generation #2207