Open ArashBayatDev opened 6 years ago
Work towards updating the VariantSpark code locally to generate frequently (every -rbs tree) dump models (built trees) to disk to create a test dataset which can be used to test the above hypothesis. Test results of this dataset will then be posted on this thread to get acceptance from everyone involved whether or not to move forward with this feature.
I recommend the following improvement to VariantSpark Random Forest importance analysis.
Compute and write importance score to a file after building every 1000 tree.
Automatically identify when enough tree has been built. If implementing the first suggestion then we can compare importance score at each step (1000 trees built) with the importance scores computed in the previous step. if little change has happened then we can stop building more trees.
Frequently (every -rbs tree) dump models (built trees) to disk and allowing to integrate previously built models in a new run. If the process crash half way produced model can be used in the next run.