Closed unary-code closed 1 year ago
What are the contents of BasicExampleOfRegressionConfigFile.json
?
Yes, it’s generally a good idea to specify the log path in your config.
Does it work as expected with the same config when running outside a notebook? I don’t run from notebooks, at least not using DSO’s CLI.
Thanks for responding to me. Sorry that I took long to respond. Something that confuses me is that the output printed by running the command "python3 -m dso.run ./BasicExampleOfRegressionConfigFile.json" implies that the computer was able to generate files under the log folder (for example, the "Saving Hall of Fame plot to ./log/.d_lv1BETTER_2023-06-16-185510/dso.d_lv1BETTER_plot_hof.png" output that was printed by that command), but for some reason the log folder is empty. And what's more, is that running the command "python3 -m dso.run ./BasicExampleOfRegressionConfigFile.json" does create a log folder (I know this because I have deleted the log folder beforehand, and then ran that command, and that command creates a empty log folder), but that log folder is always empty (for me at least).
Question: Perhaps there is another parameter, besides logdir, that I need to specify that will tell the DSO.run program to actually save the files rather than not saving the files (Perhaps this parameter is called something like "do-save-files")? Or a parameter that will to know where it should save the files (Perhaps this parameter is called something like "dir-to-save-files-in")?
Question (continued): Or perhaps there is some path that the code takes where the code will break and exit itself before it ever actually saves and puts the file (for example, the d_lv1BETTER_2023-06-16-185510/dso.d_lv1BETTER_plot_hof.png file) under the log folder.
Or perhaps, a simpler option would be to use the terminal directly to run the command "python3 -m dso.run ./BasicExampleOfRegressionConfigFile.json".
2.
The contents of BasicExampleOfRegressionConfigFile.json are: { "task" : { "task_type" : "regression", "dataset" : "./d_lv1BETTER.csv", "function_set" : ["add", "sub", "mul", "div", "sin", "cos", "exp", "log", "poly"] } }
Where and how would I specify the log path? Like the below? From my understanding, the below would { "task" : { "task_type" : "regression", "dataset" : "./d_lv1BETTER.csv", "logdir": "../log" "function_set" : ["add", "sub", "mul", "div", "sin", "cos", "exp", "log", "poly"] } }
Hey Brenden Petersen,
This command "!python3 -m dso.run ./dso/dso/config/config_regression.json --b Nguyen-7" was able to generate files under the log folder. Specifically, for me that command immediately generated a log folder, and then under that log folder, it generated a "Nguyen-7_2023-06-19-213754" folder, and then under that folder, it generated a empty "checkpoint" folder, a "dso_Nguyen-7_0.csv" file, and a "config.json" file. Then it took 11 more minutes after that command had supposedly stopped running and suddenly my computer generated 2 more csv files and 2 png files and one "summary.csv" file. According to my Jupyter notebook ipynb file, the "!python3 -m dso.run ./dso/dso/config/config_regression.json --b Nguyen-7" command that I ran within the ipynb file took only maybe 20 seconds to run. However, 11 minutes after the "!python3 -m dso.run ./dso/dso/config/config_regression.json --b Nguyen-7" had finished executing, the "summary.csv" file and 2 more csv files and 2 png files were generated. Question: Is this expected behavior?
In addition, I am confused on where in the generated files should I look to find the final equation that the program found to describe the dataset. Question: Am I just supposed to look at "summary.csv" to see the equation that the program found?
Also, the summary.csv file shows that the command found the following expression: "log(x1 + (x1*3(x1 + 1) + x1)/x1)". And the "success" column in summary.csv is True. This matches the Nguyen-7 dataset (whose true equation is "log(x + 1) + log(x^2 + 1)", so it is obvious that the program ended up using the Nguyen-7 dataset. However, the config_regression.json file that I used as the configuration file, had a "task"->"dataset" argument of "Nguyen-1", which to me suggests that the program should use the Nguyen-1 dataset. Question: Do you think because the command I ran included the arguments "--b Nguyen-7", that that was why the program used the Nguyen-7 dataset, instead of the Nguyen-1 dataset?
I attached the generated files to this post ("dso_Nguyen-7_0.csv" file, the "config.json" file , the "summary.csv" file and the two other csv files and the two png files). I think the "summary.csv" file would be the most important to look at.
config.txt NOTE that The program generated a config.json file, not a config.txt, but Github does not let me upload a json file, so I changed the file format to txt. dso_Nguyen-7_0_pf.csv
Question: Perhaps there is another parameter, besides logdir, that I need to specify that will tell the DSO.run program to actually save the files rather than not saving the files (Perhaps this parameter is called something like "do-save-files")?
There are additional parameters to control whether or not something should be saved, but they are turned on by default. And from your follow-up it looks like these are working properly as you are generating output json
, csv
, and png
files.
Question (continued): Or perhaps there is some path that the code takes where the code will break and exit itself before it ever actually saves and puts the file (for example, the d_lv1BETTER_2023-06-16-185510/dso.d_lv1BETTER_plot_hof.png file) under the log folder.
No. I think it's a Python notebook quirk. (See next answer.)
Question: Is this expected behavior?
No. I can't really speak to behavior in a Python notebook (e.g. files showing up a few minutes later). I would suggest running straight from the terminal/shell, not from a notebook.
Where and how would I specify the log path?
In your config, like this:
{
"experiment" : {
"logdir" : "./log", // here
},
"other stuff" : { ... }
}
Question: Am I just supposed to look at "summary.csv" to see the equation that the program found?
Yes, looks like you found it.
In your attached summary.csv
, it looks like everything worked and it found the correct expression.
However, the config_regression.json file that I used as the configuration file, had a "task"->"dataset" argument of "Nguyen-1", which to me suggests that the program should use the Nguyen-1 dataset.
By adding the --b
flag in command-line, you overwrote the "Nguyen-1"
benchmark to be "Nguyen-7"
. This is just a convenience option since it's common to use the same config but change the benchmark. If you look at the generated config.json
(the one you attached as config.txt
), you will see that it correctly lists "Nguyen-7"
as the task name.
Question: Do you think because the command I ran included the arguments "--b Nguyen-7", that that was why the program used the Nguyen-7 dataset, instead of the Nguyen-1 dataset?
Yes, see previous answer.
Or perhaps, a simpler option would be to use the terminal directly to run the command "python3 -m dso.run ./BasicExampleOfRegressionConfigFile.json".
Yes, this is what I was suggesting.
Thank you for your very fast response! It's greatly appreciated.
I have a dataset called "d_lv1BETTER.csv" (one of the Strogatz datasets), but using this config file (BasicExampleOfRegressionConfigFile.json, but I attached it on this post as a TXT file), it's taking a long time for the command to finish its running.
What are some parameters that you suggest to change (decrease or increase) in order to reduce the runtime from perhaps 30 minutes-ish to maybe 5 to 10 minutes? Perhaps I could decrease "n_samples" (the total number of samples to generate) from 2000000 to 2000? I do not really know what particular parameter in particular is causing my command to take so long, and I thought you would know.
Is there any more documentation about what each particular parameter does? When I looked at the README, it did not go over every single parameter, for example when it comes to the "training" key, the README only explained the meaning of "n_samples" and "epsilon", but the README did not explain what the other parameters of the "training" key mean.
Also, this d_lv1BETTER.csv dataset file is formatted correctly, right? It should be numbers, rather than strings, right? Which column in the csv dataset file is supposed to correspond to the dependent variable y? The first column, or the last column?
Attached below are my dataset file (I want the command on this particular dataset file to finish finding a equation within 10 minutes or less) and my current config file (which is the EXACT same as your given "deep-symbolic-optimization/dso/dso/config/config_regression.json" file, except for the dataset path argument)
By the way in case you are wondering, the d_lv1BETTER.csv is just the same file as d_lv1.csv (From the Github https://github.com/lacava/ode-strogatz/blob/master/d_lv1.txt) but without the header row.
UPDATE: I was just able to get the DSO program to stop sooner by increasing the "threshold" parameter of "task" from 1e-20 to 4e-1, so that the Reward only had to be 0.6 or higher for the program to stop running and exit itself.
Also I tried running the "python3 -m dso.run ./BasicExampleOfRegressionConfigFile.json" command from my terminal (technically the terminal that I am using is connected to a university's (a university that I attend) server), and that printed out the same things as when I ran it in the Jupyter Notebook ipynb file. However, just like using the Jupyter Notebook ipynb file, when I used the terminal to run the command, that only immediately created the empty log folder and that did not create any files, even though the command has already terminated (I can tell that the command finished executing because the command has already given me control back to type words in the terminal). Maybe if I wait 15 minutes or some other amount of time, some files will pop up (just like how when I did the Nguyen-1 dataset, it took 11 minutes for the files to pop up).
For more info, In the terminal, after I ran the "python3 -m dso.run ./BasicExampleOfRegressionConfigFile.json" command in my terminal, the final few lines of output were (and this is very confusing to me because the output says "Saving Pareto Front plot to ./log/._d_lv1BETTER2023-06-19-225352/dso._d_lv1BETTER_plot_pf.png" but the log folder is empty even after the command has finished printing its output to the terminal) the following: " -- RUNNING ITERATIONS START ------------- -- RUNNING ITERATIONS START ------------- [00:00:00:01.57] Training iteration 8, current best R: 0.6543
** New best
Reward: 0.6542962267887297
Count Off-policy: 0
Count On-policy: 1
Originally on Policy: True
Invalid: False
Traversal: exp,sub,sin,add,x3,exp,x3,mul,x3,sin,exp,x2
Expression:
⎛ x₂⎞ ⎛ x₃⎞
- x₃⋅sin⎝ℯ ⎠ + sin⎝x₃ + ℯ ⎠
ℯ
[00:00:00:01.59] Early stopping criteria met; breaking early. Saving Hall of Fame to ./log/._d_lv1BETTER2023-06-19-225352/dso._d_lv1BETTER_0_hof.csv Saving Pareto Front to ./log/._d_lv1BETTER2023-06-19-225352/dso._d_lv1BETTER_0_pf.csv Invalid expressions: 6416 of 8000 (80.2%). Error type counts: true_divide: 1589 (24.8%) log: 3803 (59.3%) exp: 1018 (15.9%) multiply: 6 (0.1%) Error node counts: divide: 2716 (42.3%) invalid: 2676 (41.7%) overflow: 959 (14.9%) underflow: 65 (1.0%) == TRAINING SEED 0 END ============== INFO: Completed run 1 of 1 in 17 s
== POST-PROCESS START ================= -- LOADING LOGS START ---------------- Successfully loaded summary data Successfully loaded Hall of Fame data Successfully loaded Pareto Front data -- LOADING LOGS END ------------------
-- ANALYZING LOG START -------------- Task_____regression Source path__./log/._d_lv1BETTER_2023-06-19-225352 Runs_____1 Max Samples/run2000000 Success_rate___1.0 Hall of Fame (Top 5 of 100) 0: S=000 R=0.654296 <-- exp(-x3sin(exp(x2)) + sin(x3 + exp(x3))) 1: S=000 R=0.573033 <-- exp(-x3 + exp(x2)) 2: S=000 R=0.567493 <-- exp(-x3 + exp(x1exp(-x1)exp(-x2)/sin(x2))) 3: S=000 R=0.518194 <-- -sin(2x3/x2) + cos(x2) 4: S=000 R=0.511777 <-- cos(exp(x2 + x3)) + sin(x2)/x2 Saving Hall of Fame plot to ./log/._d_lv1BETTER2023-06-19-225352/dso._d_lv1BETTER_plot_hof.png Pareto Front (5 of 5) 0: S=000 R=0.654296 C=25.00 <-- exp(-x3sin(exp(x2)) + sin(x3 + exp(x3))) 1: S=000 R=0.573033 C=11.00 <-- exp(-x3 + exp(x2)) 2: S=000 R=0.464841 C=10.00 <-- cos(x2x3*(x2 + x3)) 3: S=000 R=0.464384 C=6.00 <-- 1 4: S=000 R=0.166995 C=5.00 <-- -x2 Saving Pareto Front plot to ./log/._d_lv1BETTER2023-06-19-225352/dso._d_lv1BETTER_plot_pf.png -- ANALYZING LOG END ---------------- == POST-PROCESS END =================== "
The example configs like config_common.json
and config_regression.json
should describe the parameters.
DSO runtime is roughly linear in n_samples
, so I'd adjust that (over threshold
) to your desired runtime if runtime is an issue.
NOTE: The ipynb file that I have my code in, is called HelloWorld.ipynb and this ipynb file is located under the "/home/myusername" directory. Inside the ipynb file, I did "%cd ./deep-symbolic-optimization" which made my working directory that the code operates in "/home/myusername/deep-symbolic-optimization". So the "!pip install -e ./dso" command was able to find a dso folder because the working directory that the code operates in was "/home/myusername/deep-symbolic-optimization" and because the "deep-symbolic-optimization" folder contains a dso folder.
Main question below: I typed in the command "!python3 -m dso.run ./BasicExampleOfRegressionConfigFile.json" in my Jupyter Notebook ipynb file and I ran it. This command did create a log folder in the same directory as my current working directory was, but the log folder did not contain any files or folders (the log folder was completely empty).
For your reference, this command ends up printing the following under the cell block in my Jupyter Notebook: " == EXPERIMENT SETUP START =========== Task type : regression Dataset : ./d_lv1BETTER.csv Starting seed : 0 Runs : 1 == EXPERIMENT SETUP END =============
== TRAINING SEED 0 START ============ -- BUILDING PRIOR START ------------- WARNING: Skipping invalid 'RelationalConstraint' with arguments {'targets': [], 'effectors': [], 'relationship': None}. Reason: Prior disabled. WARNING: Skipping invalid 'RepeatConstraint' with arguments {'tokens': 'const', 'min': None, 'max': 3}. Reason: Uses Tokens not in the Library. WARNING: Skipping invalid 'ConstConstraint' with arguments {}. Reason: Uses Tokens not in the Library. WARNING: Skipping invalid 'DomainRangeConstraint' with arguments {}. Reason: Prior disabled. WARNING: Skipping invalid 'LanguageModelPrior' with arguments {'weight': None}. Reason: Prior disabled. WARNING: Skipping invalid 'MultiDiscreteConstraint' with arguments {'dense': False, 'ordered': False}. Reason: Prior disabled. LengthConstraint: Sequences have minimum length 4. Sequences have maximum length 64. RelationalConstraint: [exp] cannot be a child of [log]. InverseUnaryConstraint: RelationalConstraint: [log] cannot be a child of [exp]. TrigConstraint: [sin, cos] cannot be a descendant of [sin, cos]. NoInputsConstraint: Sequences contain at least one input variable Token. UniformArityPrior: Activated. SoftLengthPrior: No description available. RepeatConstraint: [poly] cannot occur more than 1 times. RelationalConstraint: [poly] cannot be a descendant of [cos, sin]. -- BUILDING PRIOR END ---------------
-- RUNNING ITERATIONS START ------------- [00:00:00:02.29] Training iteration 1, current best R: 0.8948
-- RUNNING ITERATIONS START ------------- -- RUNNING ITERATIONS START ------------- -- RUNNING ITERATIONS START ------------- -- RUNNING ITERATIONS START ------------- [00:00:00:00.40] Training iteration 5, current best R: 1.0000
[00:00:00:00.43] Early stopping criteria met; breaking early. Saving Hall of Fame to ./log/._d_lv1BETTER2023-06-16-185510/dso._d_lv1BETTER_0_hof.csv Saving Pareto Front to ./log/._d_lv1BETTER2023-06-16-185510/dso._d_lv1BETTER_0_pf.csv Invalid expressions: 4169 of 5000 (83.4%). Error type counts: true_divide: 978 (23.5%) log: 2505 (60.1%) exp: 683 (16.4%) multiply: 3 (0.1%) Error node counts: invalid: 1684 (40.4%) divide: 1799 (43.2%) overflow: 646 (15.5%) underflow: 40 (1.0%) == TRAINING SEED 0 END ============== INFO: Completed run 1 of 1 in 8 s
== POST-PROCESS START ================= -- LOADING LOGS START ---------------- Successfully loaded summary data Successfully loaded Hall of Fame data Successfully loaded Pareto Front data -- LOADING LOGS END ------------------
-- ANALYZING LOG START -------------- Task_____regression Source path__./log/._d_lv1BETTER_2023-06-16-185510 Runs_____1 Max Samples/run2000000 Success_rate___1.0 Hall of Fame (Top 5 of 100) 0: S=000 R=1.000000 <-- (-0.5*x22 - 0.5x2x32 + 1.5x2x3 + 0.5x2 + 0.5x32 - 1.5x3)/(x2x3 - x3) 1: S=000 R=0.894786 <-- 1.42456e-6*x12x2 + 8.14836e-6x12x3 - 1.4799e-6x12 + 0.00199591x1x22 + 0.00898762x1x2x3 - 0.00327487x1x2 + 0.000407518x1*x32 - 0.00400568x1x3 + 1.49652e-6x1 + 0.0157843x23 + 0.102194*x22x3 - 0.319182x22 + 0.230496x2x32 - 1.13702x2x3 - 0.605621x2 - 0.86704x33 + 1.22169*x32 - 2.64657x3 + exp(x3) + 1.00125 2: S=000 R=0.888423 <-- -9.43977e-6x12x2 - 2.38065e-5x12x3 + 3.69106e-6x12 - 0.000312859x1x22 - 0.00654212x1x2x3 + 0.00987218x1x2 - 0.00246847x1*x32 + 0.0153959x1x3 - 0.000995742x1 - 0.00191899x23 - 0.0437459*x22x3 - 0.0255437x22 - 0.480404x2x32 + 2.51962x2x3 - 3.60197x2 - 0.43249x33 + 2.77857*x32 - 5.80736x3 + sin(x3) + cos(x2) + 1.04708 3: S=000 R=0.872060 <-- (9.48347e-5x12x2 + 9.74389e-5x12x3 - 1.07346e-5x12 - 0.0101637x1x22 - 0.0369205x1x2x3 - 0.515662x1x2 - 0.502299x1*x32 + 1.4711x1x3 + 0.00173305x1 - 0.0638882x23 - 0.826748*x22x3 + 3.49093x22 - 0.246445x2x32 + 3.35957x2x3 + 2.784x2 - 0.663574x33 + 3.25044*x3*2 - 2.2956x3 + 5.34059)/(-x3(-x1 - x3 + log(sin(1))) + exp(exp(sin(x2)))) 4: S=000 R=0.825266 <-- (-5.01322e-5x12x2 - 1.35739e-5x12x3 - 7.28942e-6x12 + 0.000234641x1x22 - 0.00243223x1x2x3 + 0.0198567x1x2 - 0.000188668x1*x32 + 0.00577324x1x3 + 0.00102435x1 + 0.00576275x23 + 0.0763305*x22x3 - 0.298609x22 - 0.0874824x2x32 + 0.648787x2x3 - 2.60374x2 - 0.240229x3*3 + 1.2891x3*2 - 3.42536x3 + 3.65288)*cos(1) Saving Hall of Fame plot to ./log/._d_lv1BETTER2023-06-16-185510/dso._d_lv1BETTER_plot_hof.png Pareto Front (3 of 3) 0: S=000 R=1.000000 C=13.00 <-- (-0.5*x22 - 0.5x2x32 + 1.5x2x3 + 0.5x2 + 0.5x3*2 - 1.5x3)/(x2*x3 - x3) 1: S=000 R=0.464384 C=6.00 <-- 1 2: S=000 R=0.323619 C=5.00 <-- 0 Saving Pareto Front plot to ./log/._d_lv1BETTER2023-06-16-185510/dso._d_lv1BETTER_plot_pf.png -- ANALYZING LOG END ---------------- == POST-PROCESS END =================== "
Do I need to specify the logdir parameter, either in the config file or in the command that I used to run dso.run?