comet-ml / issue-tracking

Questions, Help, and Issues for Comet ML
https://www.comet.ml
85 stars 7 forks source link

Run will not be logged #526

Closed fardinabbasi closed 10 months ago

fardinabbasi commented 11 months ago

What is your question related to?

What is your question?

I am running an RLlib experiment on a remote server using the bash shell, using Comet as callbacks. However, I am encountering an error where the run will not be logged. Here are the details of the issue:

Code

COMET WARNING: Failed to check backend version at URL: 'https://www.comet.com/clientlib/isAlive/ver' COMET ERROR: Run will not be logged For more details, please refer to: https://www.comet.com/docs/v2/api-and-sdk/python-sdk/warnings-errors/ Traceback (most recent call last): File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/connection.py", line 203, in _new_conn sock = connection.create_connection( File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection raise err File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) OSError: [Errno 101] Network is unreachable The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/connectionpool.py", line 790, in urlopen response = self._make_request( File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/connectionpool.py", line 491, in _make_request raise new_e File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/connectionpool.py", line 467, in _make_request self._validate_conn(conn) File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1092, in _validate_conn conn.connect() File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/connection.py", line 611, in connect self.sock = sock = self._new_conn() File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/connection.py", line 218, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x2b4b3692b580>: Failed to establish a new connection: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/connectionpool.py", line 874, in urlopen return self.urlopen( File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/connectionpool.py", line 874, in urlopen return self.urlopen( File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/connectionpool.py", line 874, in urlopen return self.urlopen( File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/connectionpool.py", line 844, in urlopen retries = retries.increment( File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/urllib3/util/retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.comet.com', port=443): Max retries exceeded with url: /clientlib/logger/add/run (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2b4b3692b580>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/comet_ml/experiment.py", line 1004, in _start self.alive = self._setup_streamer() File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/comet_ml/_online.py", line 312, in _setup_streamer results = self._authenticate() File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/comet_ml/_online.py", line 398, in _authenticate run_id_response = self._get_run_id() File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/comet_ml/_online.py", line 435, in _get_run_id return self.connection.get_run_id(self.project_name, self.workspace) File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/comet_ml/connection.py", line 868, in get_run_id r = self._low_level_http_client.post( File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/comet_ml/connection.py", line 571, in post return self.do( File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/comet_ml/connection.py", line 677, in do response = session.request( File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, kwargs) File "/mainfs/scratch/sb5e19/.conda/envs/py39/lib/python3.9/site-packages/requests/adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.comet.com', port=443): Max retries exceeded with url: /clientlib/logger/add/run (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2b4b3692b580>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

What have you tried?

experiment = Experiment(project_name="TD3_TRAIN", api_key=comet_key)
class DRLlibv2:
    def __init__(
        self,
        api_key,
        trainable: str | Any,
        params: dict,
        train_env=None,
        run_name: str = "tune_run",
        local_dir: str = "tune_results",
        search_alg=None,
        concurrent_trials: int = 0,
        num_samples: int = 0,
        scheduler_=None,
        # num_cpus: float | int = 2,
        dataframe_save: str = "tune.csv",
        metric: str = "episode_reward_mean",
        mode: str | list[str] = "max",
        max_failures: int = 0,
        training_iterations: int = 100,
        checkpoint_num_to_keep: None | int = None,
        checkpoint_freq: int = 0,
        reuse_actors: bool = True
    ):
        self.params = params
        self.api_key = api_key
        # if train_env is not None:
        #     register_env(self.params['env'], lambda env_config: train_env(env_config))

        self.train_env = train_env
        self.run_name = run_name
        self.local_dir = local_dir
        self.search_alg = search_alg
        if concurrent_trials != 0:
            self.search_alg = ConcurrencyLimiter(
                self.search_alg, max_concurrent=concurrent_trials
            )
        self.scheduler_ = scheduler_
        self.num_samples = num_samples
        self.trainable = trainable
        if isinstance(self.trainable, str):
            self.trainable = self.trainable.upper()
        # self.num_cpus = num_cpus
        self.dataframe_save = dataframe_save
        self.metric = metric
        self.mode = mode
        self.max_failures = max_failures
        self.training_iterations = training_iterations
        self.checkpoint_freq = checkpoint_freq
        self.checkpoint_num_to_keep = checkpoint_num_to_keep
        self.reuse_actors = reuse_actors

    def train_tune_model(self):
        """
        Tuning and training the model
        Returns the results object
        """
        # if ray.is_initialized():
        #   ray.shutdown()

        # ray.init(num_cpus=self.num_cpus, num_gpus=self.params['num_gpus'], ignore_reinit_error=True)

        if self.train_env is not None:
            register_env(self.params['env'], lambda env_config: self.train_env)

        os.environ["TUNE_RESULT_DIR"] = self.local_dir.as_posix()
        tuner = tune.Tuner(
            self.trainable,
            param_space=self.params,
            tune_config=TuneConfig(
                search_alg=self.search_alg,
                scheduler=self.scheduler_,
                num_samples=self.num_samples,
                # metric=self.metric,
                # mode=self.mode,
                **({'metric': self.metric, 'mode': self.mode} if self.scheduler_ is None else {}),
                reuse_actors=self.reuse_actors,

            ),
            run_config=RunConfig(
                name=self.run_name,
                storage_path=self.local_dir,
                failure_config=FailureConfig(
                    max_failures=self.max_failures, fail_fast=False
                ),
                callbacks=[CometLoggerCallback(project_name=self.run_name,
                api_key=self.api_key, tags=["Second Run","TD3","Box"], save_checkpoints=True)],
                stop={"training_iteration": self.training_iterations},
                checkpoint_config=CheckpointConfig(
                    num_to_keep=self.checkpoint_num_to_keep,
                    checkpoint_score_attribute=self.metric,
                    checkpoint_score_order=self.mode,
                    checkpoint_frequency=self.checkpoint_freq,
                    checkpoint_at_end=True,
                ),
                verbose=3,#Verbosity mode. 0 = silent, 1 = default, 2 = verbose, 3 = detailed
            ),
        )
        self.results = tuner.fit()
        if self.search_alg is not None:
            self.search_alg.save_to_dir(self.local_dir)
        # ray.shutdown()
        return self.results

    def infer_results(self, to_dataframe: str = None, mode: str = "a"):
        """
        Get tune results in a dataframe and best results object
        """
        results_df = self.results.get_dataframe()

        if to_dataframe is None:
            to_dataframe = self.dataframe_save

        results_df.to_csv(to_dataframe, mode=mode)

        best_result = self.results.get_best_result()
        # best_result = self.results.get_best_result()
        # best_metric = best_result.metrics
        # best_checkpoint = best_result.checkpoint
        # best_trial_dir = best_result.log_dir
        # results_df = self.results.get_dataframe()

        return results_df, best_result

    def restore_agent(
        self,
        checkpoint_path: str = "",
        restore_search: bool = False,
        resume_unfinished: bool = True,
        resume_errored: bool = False,
        restart_errored: bool = False,
    ):
        """
        Restore errored or stopped trials
        """
        # if restore_search:
        # self.search_alg = self.search_alg.restore_from_dir(self.local_dir)
        if checkpoint_path == "":
            checkpoint_path = self.results.get_best_result().checkpoint._local_path

        restored_agent = tune.Tuner.restore(
            checkpoint_path, trainable = self.trainable,
            param_space=self.params,
            restart_errored=restart_errored,
            resume_unfinished=resume_unfinished,
            resume_errored=resume_errored,
        )
        print(restored_agent)
        self.results = restored_agent.get_results()

        if self.search_alg is not None:
            self.search_alg.save_to_dir(self.local_dir)
        return self.results

    def get_test_agent(self, test_env_name: str=None, test_env=None, checkpoint=None):
        """
        Get test agent
        """
        # if test_env is not None:
        #     register_env(test_env_name, lambda config: [test_env])

        if checkpoint is None:
            checkpoint = self.results.get_best_result().checkpoint

        testing_agent = Algorithm.from_checkpoint(checkpoint)
        # testing_agent.config['env'] = test_env_name

        return testing_agent
local_dir = Path.cwd()/"TD3_TRAIN"
drl_agent = DRLlibv2(
    api_key = comet_key,
    trainable="TD3",
    # train_env = RankingEnv,
    # num_cpus = num_cpus,
    run_name = "TD3_TRAIN",
    local_dir = local_dir,
    params = train_config.to_dict(),
    num_samples = 10,#Number of samples of hyperparameters config to run
    # training_iterations=5,
    checkpoint_freq=5,
    # scheduler_=scheduler_,
    search_alg=search_alg,
    metric = "episode_reward_mean",
    mode = "max")
res = drl_agent.train_tune_model()
test_env = RankingEnv(test_env_config)
test_agent = drl_agent.get_test_agent()
obs, _ = test_env.reset()
step_rewards = []

terminated = False
while not terminated:
    action  = test_agent.compute_single_action(observation=obs)
    obs, reward, terminated, truncated, info = test_env.step(action)
    print(info)
    step_rewards.append(reward)

mpl.use('TkAgg')
plt.plot(step_rewards)
plt.xlabel('Step')
plt.ylabel('Step Reward')
plt.title('Step Rewards Over Time')
plt.show()
experiment.log_figure(figure=plt)
dsblank commented 11 months ago

@fardinabbasi It looks like the remote server doesn't have access to the comet.com server. If it doesn't, then you can still run your code, changing one line:

Change:

from comet_ml import Experiment

to:

from comet_ml import OfflineExperiment as Experiment

That will create a zip file on the remore machine that you can retrieve and log to comet. You will loose live updates, however.

For more information, see: https://www.comet.com/docs/v2/api-and-sdk/python-sdk/advanced/running-offline/

Does that help?