Closed HansalShah007 closed 3 months ago
What we mean by splitting RunSpecs by source type is to create a separate RunSpec for each source type. The rest of the RunSpec can be identical--same time spans, same input database, same output database--but for a run with 13 source types, you'd end up with 13 RunSpecs. Your motorcycle RunSpec would have this <onroadvehicleselections>
portion:
<onroadvehicleselections>
<onroadvehicleselection fueltypeid="1" fueltypedesc="Gasoline" sourcetypeid="11" sourcetypename="Motorcycle"/>
</onroadvehicleselections>
Your passenger car RunSpec would have the following <onroadvehicleselections>
portion:
<onroadvehicleselections>
<onroadvehicleselection fueltypeid="2" fueltypedesc="Diesel Fuel" sourcetypeid="21" sourcetypename="Passenger Car"/>
<onroadvehicleselection fueltypeid="9" fueltypedesc="Electricity" sourcetypeid="21" sourcetypename="Passenger Car"/>
<onroadvehicleselection fueltypeid="5" fueltypedesc="Ethanol (E-85)" sourcetypeid="21" sourcetypename="Passenger Car"/>
<onroadvehicleselection fueltypeid="1" fueltypedesc="Gasoline" sourcetypeid="21" sourcetypename="Passenger Car"/>
</onroadvehicleselections>
And so on and so forth.
With these smaller RunSpecs, the intermediate MariaDB joins will be smaller. Smaller joins typically have better performance than larger joins, so this is why some users may see a performance improvement running 13 RunSpecs sequentially compared to running one single RunSpec.
@danielbizercox thanks for describing the strategy. Is it advisable to have separate output databases for each split of the runspec file? Can it improve performance if I run all the splits in parallel with worker partitioning?
Deciding whether or not to use the same output database depends on your post-processing preferences. However, typically we'd recommend using the same output database. When doing so, the only difference in your output between doing 1 run vs. doing 13 runs is that each source type will also have a different MOVESRunID value in your movesoutput
table.
Regarding "worker partitioning", I'm not sure what you mean by that. You can only start one main MOVES process per computer. You can launch additional MOVES workers, which may potentially speed up each individual run, but this typically has a minor impact and we generally do not see much improvement beyond 3 workers.
However, if you have multiple computers with MOVES installed (e.g., a cluster of VMs), you can launch each RunSpec in parallel, and that will produce output significantly faster. However, this will result in separate output databases on each computer. To facilitate post-processing in this use case, we have a MOVES Output Grouper tool that can stitch together multiple output databases into a single one: https://github.com/USEPA/EPA_MOVES_Model/blob/master/tools/MOVESOutputGrouper.md
@danielbizercox by "worker partitioning" I mean that I start multiple sets of workers on multiple command lines each with a different shared folder configuration. Its a lot of manual work but I wanted to test this out.
So, essentially:
sharedDistributedFolderPath
field in the manyworkers.txt and WorkerConfiguration.txt file and start a group of 3 workers using the command ant 3workers -Dnoshutdown=1
. sharedDistributedFolderPath
field in the MOVESConfiguration.txt file to add the TODO files in the same shared folder as the one configured for the 3 workers running above.I store the output of the splits in different output databases.
Will this give me accurate results or is it not possible to do this? I believe I can do this as long as I have enough logical processors on one computer to run 13 runspec files in parallel where each one of them have a dedicated set of workers helping them.
It is not possible to run 13 RunSpec files in parallel; each call to ant run -Drunspec=...
creates a MOVES main process, and you can only have one main process running per computer. See https://github.com/USEPA/EPA_MOVES_Model/issues/75#issuecomment-2231845461 for more details.
Thanks for the information.
I was reading the document that mentions strategies for making a MOVES run faster: https://github.com/USEPA/EPA_MOVES_Model/blob/master/docs/TipsForFasterMOVESRuns.md
In here one of the strategies talk about splitting the Runspecs by source type. How can we do this? I am attaching a sample runspec file for reference. How do we split it on specific source types?
@danielbizercox can you help me out here? Thanks