JMSLab / Template

Template for research repository using scons.
9 stars 1 forks source link

Test parallelizing scons run #72

Closed rcalvo12 closed 1 year ago

rcalvo12 commented 1 year ago

scons represents the build steps as a directed acyclic graph.

This means it can calculate which paths can be built in parallel and which must be built in sequence.

The -j option is designed to allow a user to run multiple simultaneous tasks.

We'd like to explore whether it's possible to use this option successfully on FASRC.

jmshapir commented 1 year ago

@MosesStewart welcome to Template! Once you accept the invitation to join this repository, please self-assign and review the issue description.

A first step is probably to make sure you can build the entire Template on FASRC. After that you can start testing parallel builds.

Please keep in mind that this repository is in the public domain.

Thanks!

MosesStewart commented 1 year ago

We'd like to explore whether it's possible to use this option successfully on FASRC.

To give an update, I'm able to use the -j option successfully without any warnings when running on my own system. I'm still trying to sort out the executable path for stata on the FASRC to test it there.

jmshapir commented 1 year ago

To give an update, I'm able to use the -j option successfully without any warnings when running on my own system.

@MosesStewart that's great! Do you get any speedup in runtime, or is the repository too fast to be able to tell?

In the latter case, we could create an issue branch and add some slow operations to the repository (that are not dependent on one another) so we can see if we're getting any improvement from parallelizing.

I'm still trying to sort out the executable path for stata on the FASRC to test it there.

Got it. Per here:

https://github.com/JMSLab/Template/blob/1bedc394b8898f3b4ea44c922c335a51ebc53974/source/lib/JMSLab/builders/executables.yml#L7

The builders are going to look for Stata under the name StataMP-64. If when you type StataMP-64 from the bash prompt Stata doesn't run, then the builders likely won't be able to run it, either.

In that case, you could look for the correct command to run Stata from the bash prompt. Suppose for the sake of argument that it's StataCommand.

Then I think we just need to tell Linux that when you type StataMP-64 you really mean StataCommand. I think (?) aliases will do this but I haven't tested it directly (at least not for a while).

Let me know if any of that is helpful, thanks.

MosesStewart commented 1 year ago

@jmshapir changing the stata executable in Template/source/lib/JMSLab/builders/executables.yml to stata-mp fixed the problem!

Thank you so much for the detailed explanation! I really appreciate it!

I didn't have permission to create a simlink in the FASRC applications folder, so I think I'll run it like so!

MosesStewart commented 1 year ago

Do you get any speedup in runtime, or is the repository too fast to be able to tell?

The repository is too fast for me to tell :sob:

In the latter case, we could create an issue branch and add some slow operations

Would it be possible for me to gain write-access to edit the code?

jmshapir commented 1 year ago

@MosesStewart

@jmshapir changing the stata executable in Template/source/lib/JMSLab/builders/executables.yml to stata-mp fixed the problem!

Thank you so much for the detailed explanation! I really appreciate it!

I didn't have permission to create a simlink in the FASRC applications folder, so I think I'll run it like so!

Sounds good! Let's just be sure not to commit the revised executables.yml to the repository.

(And when @rcalvo12 @ew487 let's ask them what solution they use for this detail about the executables.)

Do you get any speedup in runtime, or is the repository too fast to be able to tell?

The repository is too fast for me to tell 😭

In the latter case, we could create an issue branch and add some slow operations

Would it be possible for me to gain write-access to edit the code?

Done! Use it wisely.

MosesStewart commented 1 year ago

Does anyone know how to force scons to build all target files, even if they're up-to-date?

I've tried scons . and scons /, which both should build all target files according to the link in https://github.com/JMSLab/Template/issues/72#issue-1476787870. I've also tried scons --config=force, but I haven't been able to get scons to re-build the entire repository when it's up to date already.

I'm trying to re-build the entire repository to compare times with and without the -j option.

jmshapir commented 1 year ago

@MosesStewart thanks!

Does anyone know how to force scons to build all target files, even if they're up-to-date?

I'm not sure but here are some ideas:

I've tried scons . and scons /, which both should build all target files according to the link in #72 (comment).

Those commands will only build targets that are not currently up to date, since scons is designed to make sure targets are up to date using the minimum amount of compute.

I'm trying to re-build the entire repository to compare times with and without the -j option.

Great!

MosesStewart commented 1 year ago

I'm not sure but here are some ideas:

Thank you so much! The StackExchange solution worked! I'm getting very interesting results with the -j option! # Cores | Time: 0:00 | Completion 1 core | Time: 3:27 | Completed 2 core | Time: 2:19 | Failed 4 core | Time: 0:50 | Failed

The 2 and 4 task parallelizations that failed both did so in paper/SConscript, which is almost the last step the scons build and is almost instant. Therefore, I'm confident that the -j option does significantly increase speed at approximately a 1:1 ratio with the number of cores.

I'll try to dig into why the lyx environment is raising an ExecCallError when we use multiple cores.

MosesStewart commented 1 year ago

@jmshapir I don't think the -j option can be used successfully in most cases.

From my testing, I hypothesize that when we ask scons to build 3 targets in parallel, it will always try to build 3 targets at the same time. In other words, even if the target scons needs to build next depends on a target currently being built, scons will try to start building it anyways and the program will fail.

In this repository, both the build #output/paper/TemplateTex.pdf and build #output/paper/TemplateLyx.pdf were failing when building multiple targets only when they included #source/figures/gdp_educ.eps and #output/tables/top_gdp.tex, which is built by scons the step before. The only way I could get the build to complete was by removing the dependencies on earlier steps. Therefore, I don't think that the -j option is optimized so that scons knows to wait for earlier dependencies to finish building.

I do think if scons gets an update, this could easily be fixed and the -j option could become incredibly useful

jmshapir commented 1 year ago

Thanks @MosesStewart!

In this repository, both the build #output/paper/TemplateTex.pdf and build #output/paper/TemplateLyx.pdf were failing when building multiple targets only when they included #output/tables/top_gdp.tex, which is built by scons the step before.

Can you point me to which SConscript controls the building of #output/tables/top_gdp.tex? I want to make sure I understand how the DAG is structured here in case there's something we're missing.

Therefore, I don't think that the -j option is optimized so that scons knows to wait for earlier dependencies to finish building.

Since parallelization is a key part of the scons design, I'd be surprised if it has this limitation. If it did, I'd expect it would have been reported. If you have bandwidth, it might be worth your spending a little time (say, not more than 1-2 hours) on the scons github repository to see if there are any open issues that relate to the problem we're having here.

MosesStewart commented 1 year ago

Can you point me to which SConscript controls the building of #output/tables/top_gdp.tex? I want to make sure I understand how the DAG is structured here in case there's something we're missing.

Sorry I attached the wrong file name! It is #output/tables/top_gdp.lyx not .tex that is being built in this step: https://github.com/JMSLab/Template/blob/1bedc394b8898f3b4ea44c922c335a51ebc53974/source/tables/SConscript#L3-L5 I believe that #output/tables/top_gdp.tex isn't being built by anything.

Since parallelization is a key part of the scons design, I'd be surprised if it has this limitation. If it did, I'd expect it would have been reported. If you have bandwidth, it might be worth your spending a little time (say, not more than 1-2 hours) on the scons github repository to see if there are any open issues that relate to the problem we're having here.

Yes! I will start this right away!

jmshapir commented 1 year ago

I believe that #output/tables/top_gdp.tex isn't being built by anything.

Thanks @MosesStewart! In that case, do we know why TemplateTex.pdf isn't building successfully when we paralellize? It seems like the dependency on top_gdp.tex can't be the culprit, since as you say that file isn't being built by anything.

MosesStewart commented 1 year ago

Thanks @MosesStewart! In that case, do we know why TemplateTex.pdf isn't building successfully when we paralellize? It seems like the dependency on top_gdp.tex can't be the culprit, since as you say that file isn't being built by anything.

No, I hadn't noticed before that top_gdp.tex wasn't dependent on anything, but I'm sure that editing source/papaer/template.tex to not include output/tables/top_gdp allows scons to not fail on this step of the process. I'm at a loss for the cause ~ I'm trying to change different things and see what causes it to fail here during parallelization

jmshapir commented 1 year ago

Thanks @MosesStewart!

I wonder if it might be a good idea to build a minimal working example of what we're trying to do, maybe in a separate repository? That way we could strip out some of the complexity of the Template and focus on the bare essentials.

If that sounds good to you here is a process I propose:

If we find that parallelization breaks here then we have a simple example we can use to diagnose potential explanations. We could even post an issue to the scons github repository pointing to the example.

If we find that parallelization works here then that might help us get to the bottom of why it's breaking in Template.

Let me know what you think. If you think it would be helpful to talk this through feel free to shoot me an e-mail and we can schedule a call or meeting.

Thanks!

MosesStewart commented 1 year ago

The test repository is here. I'm currently changing things in test.tex and important_figure.tex to see if I can break scons with the -j function.

jmshapir commented 1 year ago

Thanks @MosesStewart! That test repository looks good to me. A couple thoughts:

If we're finding that -j works in the test repo but not in Template, that could mean that something is off with the way we've specified the DAG in Template. In order to parallelize the build, it's crucial that scons has all the dependencies exactly right, since otherwise it might try to build targets before their inputs are ready.

To visualize the DAG, I think you can use scons --tree=type where type determines exactly what sort of tree is drawn. Maybe it's worth having a look at the DAG in case we're missing something?

MosesStewart commented 1 year ago

To visualize the DAG, I think you can use scons --tree=type where type determines exactly what sort of tree is drawn. Maybe it's worth having a look at the DAG in case we're missing something?

That's very helpful! I'll look into it!

  • Do we need important_figure.tex? I'd have thought that for a minimal example the pipeline test.tex-->test.pdf-->test.zip might be sufficient, since that would support -j with 2 simultaneous jobs?

I wanted to test if anything my be breaking when trying to import a file. Currently the -j is working in the test repo, and me changing file locations and such isn't breaking it.

MosesStewart commented 1 year ago

@jmshapir nothing jumped out at me after looking at the DAG, but I narrowed down exactly what's causing the parallelization to fail.

Changing the repository by:

allows the parallelization to work.

I didn't have to change any of the dependencies in source/paper/SConscript for it to compile. So, I think something's happening when including the source figures in either document that's causing scons to fail only when using the -j option, but succeeding otherwise. I included what the repository looks like when it succeeds in a2c53bc

jmshapir commented 1 year ago

Thanks @MosesStewart!

Does the DAG reflect the fact that gdp_educ.lyx and gdp_educ.tex depend on gdp_educ.eps?

MosesStewart commented 1 year ago

@jmshapir they depend on gpd_educ.lyx and gpd_educ.tex respectively:

For source/papers/template.lyx: https://github.com/JMSLab/Template/blob/227dcce1700e3c559e3f9a371df9857bfaddbf91/sconstruct.log#L85-L96 https://github.com/JMSLab/Template/blob/227dcce1700e3c559e3f9a371df9857bfaddbf91/sconstruct.log#L95

For source/papers/template.tex https://github.com/JMSLab/Template/blob/227dcce1700e3c559e3f9a371df9857bfaddbf91/sconstruct.log#L97-L101 https://github.com/JMSLab/Template/blob/227dcce1700e3c559e3f9a371df9857bfaddbf91/sconstruct.log#L100

jmshapir commented 1 year ago

@MosesStewart that makes sense. But if we look at gdp_educ.lyx, for example, it includes a call to gdp_educ.eps:

https://github.com/JMSLab/Template/blob/1bedc394b8898f3b4ea44c922c335a51ebc53974/source/figures/gdp_educ.lyx#L114

Does any SConscript currently declare that gdp_educ.lyx depends on gdp_educ.eps?

If not, I wonder if there's a hole in the DAG, such that we have represented to scons that we have

stuff-->gdp_educ.eps gdp_educ.lyx-->Template.lyx

but really we have

stuff-->gdp_educ.eps-->gdp_educ.lyx-->Template.lyx

If that were the case, then the DAG would imply (incorrectly) that it is fine to build, say, gdp_educ.eps and Template.lyx at the same time.

Could that be the culprit here?

MosesStewart commented 1 year ago

@jmshapir I didn't notice that ~ that seems very likely ~ I will test right away

MosesStewart commented 1 year ago

It works!!!!!!!!!!!!!!!!!!!!!!!!!!!!! MWAHAHAHAHAHAHAHAHAHAHAHA

jmshapir commented 1 year ago

@MosesStewart nice! One thing I think we've learned that if we want to use this option we need to be quite meticulous about declaring dependencies, even those that aren't needed for a serial build.

The other thing I'd ideally like to check before we leave the issue is that we are actually seeing some improvements in run time. Is it possible to look into that? As before, it might be necessary to add a "speedbump" to some of the steps in order to make the run time calculations meaningful.

@rcalvo12 @ew487

MosesStewart commented 1 year ago

@jmshapir currently I'm getting 3:18 without parallelization and 1:56 with parallelization (~42% speed boost). This is only with "speedbumps" in 2 of the files though ~ the more speedbumps I add, the faster the parallelization becomes!

jmshapir commented 1 year ago

@MosesStewart that's great! It sounds to me like this is working as intended.

The next steps I envision are the following:

Of course if you think I've missed something just say!

MosesStewart commented 1 year ago

Summary:

In this issue we tested the functionality of the -j option for scons in the repository.

After testing, we found that it is pivotal to carefully indicate all dependencies so the DAG is set up to prevent scons from building dependent files in parallel. We updated source/figures/scons to prevent scons from building gdp_educ.eps and template.pdf in parallel when using the -j option, opening up parallelization in the repository.

Final state of the issue branch is in f6297fe