I believe the assumptions here are overly simplistic. TOT is one of the solutions, but what is truly important is RL's exploration in unknown domains. Please refer to this article published by OpenAI: Weak to Strong Generalization. When we supervise GPT-4 with a GPT-2-level model using this method on NLP tasks, the resulting model typically performs somewhere between GPT-3 and GPT-3.5. This is how they can extend their solutions infinitely. They transform a single model into a logical reasoning enhancer, allowing the knowledge from a smaller model to enhance the capabilities of a larger model. You can imagine that this model is no longer just a text transformer but a transformer for all problem formulas and strategies. As the model extends outward with new capabilities, it will no longer be possible to encapsulate all formulas within a single model. This will lead to an expansion of the model size, embedding all newly discovered formulas and strategies into the new model. Therefore, this approach can be infinitely scalable, although the computational demands will be extremely high. To summarize, the new model's chain of thought (COT) is trained through RL. When your COT is effective, it can solve problems correctly, but the strategy for breaking down problems is derived from RL training.
I believe the assumptions here are overly simplistic. TOT is one of the solutions, but what is truly important is RL's exploration in unknown domains. Please refer to this article published by OpenAI: Weak to Strong Generalization. When we supervise GPT-4 with a GPT-2-level model using this method on NLP tasks, the resulting model typically performs somewhere between GPT-3 and GPT-3.5. This is how they can extend their solutions infinitely. They transform a single model into a logical reasoning enhancer, allowing the knowledge from a smaller model to enhance the capabilities of a larger model. You can imagine that this model is no longer just a text transformer but a transformer for all problem formulas and strategies. As the model extends outward with new capabilities, it will no longer be possible to encapsulate all formulas within a single model. This will lead to an expansion of the model size, embedding all newly discovered formulas and strategies into the new model. Therefore, this approach can be infinitely scalable, although the computational demands will be extremely high. To summarize, the new model's chain of thought (COT) is trained through RL. When your COT is effective, it can solve problems correctly, but the strategy for breaking down problems is derived from RL training.