Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.93k stars 3.34k forks source link

Lightning CLI fails to start in virtual environment following examples #17799

Closed Kaszanas closed 1 month ago

Kaszanas commented 1 year ago

Bug description

This issue is apparent when attempting to run the following tutorial: https://lightning.ai/pages/community/tutorial/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm/ but might be a sign of a more serious issue where lightning command fails to run from within a virtual environment.

Unfortunately the code snippets in the tutorial alone do not point towards a final running solution of the tutorial. I have had to reach into the repository code to find out what is happening. My replication of this tutorial is attached in the code section below.

When attempting to run the final tutorial command there are issues as highlighted below:

lightning run model \
    --accelerator=gpu \
    --strategy=ddp \
    --devices=2 \
    main.py \
    --capture-video \
    --env-id CartPole-v1 \
    --total-timesteps 100000 \
    --num-envs 2 \
    --num-steps 512

What version are you seeing the problem on?

v2.0

How to reproduce the bug

Please note that most of the code below comes from the Lightning repository itself.

.py file contents:

```python
import argparse
import os
from datetime import datetime, time
from typing import Dict, Optional

import gymnasium as gym
import torch
from lightning.fabric import Fabric
from lightning.fabric.loggers import TensorBoardLogger
from torch.utils.data import BatchSampler, RandomSampler

from agent import PPOLightningAgent

def train(
    fabric: Fabric,
    agent: PPOLightningAgent,
    optimizer: torch.optim.Optimizer,
    data: Dict[str, torch.Tensor],
    global_step: int,
    args: argparse.Namespace,
):
    sampler = RandomSampler(list(range(data["obs"].shape[0])))
    sampler = BatchSampler(
        sampler, batch_size=args.per_rank_batch_size, drop_last=False
    )

    for _ in range(args.update_epochs):
        for batch_idxes in sampler:
            loss = agent.training_step({k: v[batch_idxes] for k, v in data.items()})
            optimizer.zero_grad(set_to_none=True)
            fabric.backward(loss)
            fabric.clip_gradients(agent, optimizer, max_norm=args.max_grad_norm)
            optimizer.step()
        agent.on_train_epoch_end(global_step)

def make_env(
    env_id: str,
    seed: int,
    idx: int,
    capture_video: bool,
    run_name: Optional[str] = None,
    prefix: str = "",
):
    def thunk():
        env = gym.make(env_id, render_mode="rgb_array")
        env = gym.wrappers.RecordEpisodeStatistics(env)
        if capture_video:
            if idx == 0 and run_name is not None:
                env = gym.wrappers.RecordVideo(
                    env,
                    os.path.join(run_name, prefix + "_videos" if prefix else "videos"),
                    disable_logger=True,
                )
        env.action_space.seed(seed)
        env.observation_space.seed(seed)
        return env

    return thunk

def linear_annealing(
    optimizer: torch.optim.Optimizer, update: int, num_updates: int, initial_lr: float
):
    frac = 1.0 - (update - 1.0) / num_updates
    lrnow = frac * initial_lr
    for pg in optimizer.param_groups:
        pg["lr"] = lrnow

def main(args):
    run_name = f"{args.env_id}_{args.exp_name}_{args.seed}_{int(time.time())}"
    logger = TensorBoardLogger(
        root_dir=os.path.join(
            "logs", "fabric_logs", datetime.today().strftime("%Y-%m-%d_%H-%M-%S")
        ),
        name=run_name,
    )

    # Initialize Fabric
    fabric = Fabric()
    rank = fabric.global_rank  # The rank of the current process
    world_size = fabric.world_size  # Number of processes spawned
    device = fabric.device
    fabric.seed_everything(42)  # We seed everything for reproduciability purpose

    # given an initial seed of 42 and 4 environments per rank, then
    # rank-0 will seed the environments with --> 42, 43, 44, 45
    # rank-1 will seed the environments with --> 46, 47, 48, 49
    # and so on
    envs = gym.vector.SyncVectorEnv(
        [
            make_env(
                args.env_id,
                args.seed + rank * args.num_envs + i,
                rank,
                args.capture_video,
                logger.log_dir,
                "train",
            )
            for i in range(args.num_envs)
        ]
    )

    agent = PPOLightningAgent(
        envs=envs,
        act_fun=args.activation_function,
        vf_coef=args.vf_coef,
        ent_coef=args.ent_coef,
        clip_coef=args.clip_coef,
        clip_vloss=args.clip_vloss,
        ortho_init=args.ortho_init,
        normalize_advantages=args.normalize_advantages,
    )
    optimizer = agent.configure_optimizers(args.learning_rate)

    # accelerated training with Fabric
    agent, optimizer = fabric.setup(agent, optimizer)

    with fabric.device:
        # with fabric.device is only supported in PyTorch 2.x+
        obs = torch.zeros(
            (args.num_steps, args.num_envs) + envs.single_observation_space.shape
        )
        actions = torch.zeros(
            (args.num_steps, args.num_envs) + envs.single_action_space.shape
        )
        rewards = torch.zeros((args.num_steps, args.num_envs))
        dones = torch.zeros((args.num_steps, args.num_envs))

        # Log-probabilities of the action played are needed later on during the training phase
        logprobs = torch.zeros((args.num_steps, args.num_envs))

        # The same happens for the critic values
        values = torch.zeros((args.num_steps, args.num_envs))

    # Global variables
    global_step = 0
    single_global_rollout = int(args.num_envs * args.num_steps * world_size)
    num_updates = args.total_timesteps // single_global_rollout

    with fabric.device:
        # Get the first environment observation and start the optimization
        next_obs = torch.tensor(envs.reset(seed=args.seed)[0])
        next_done = torch.zeros(args.num_envs)

    # Collect `num_steps` experiences `num_updates` times
    for update in range(1, num_updates + 1):
        # Learning rate annealing
        if args.anneal_lr:
            linear_annealing(optimizer, update, num_updates, args.learning_rate)

        for step in range(0, args.num_steps):
            global_step += args.num_envs * world_size
            obs[step] = next_obs
            dones[step] = next_done

            # Sample an action given the observation received by the environment
            with torch.no_grad():
                action, logprob, _, value = agent.get_action_and_value(next_obs)
                values[step] = value.flatten()
            actions[step] = action
            logprobs[step] = logprob

            # Single environment step
            next_obs, reward, done, truncated, info = envs.step(action.cpu().numpy())

            # Check whether the game has finished or not
            done = torch.logical_or(torch.tensor(done), torch.tensor(truncated))

            with fabric.device:
                rewards[step] = torch.tensor(reward).view(-1)
                next_obs, next_done = torch.tensor(next_obs), done

    # Estimate advantages and returns with GAE ()

    returns, advantages = agent.estimate_returns_and_advantages(
        rewards,
        values,
        dones,
        next_obs,
        next_done,
        args.num_steps,
        args.gamma,
        args.gae_lambda,
    )

    # Flatten the batch
    local_data = {
        "obs": obs.reshape((-1,) + envs.single_observation_space.shape),
        "logprobs": logprobs.reshape(-1),
        "actions": actions.reshape((-1,) + envs.single_action_space.shape),
        "advantages": advantages.reshape(-1),
        "returns": returns.reshape(-1),
        "values": values.reshape(-1),
    }

    # Train the agent
    train(fabric, agent, optimizer, local_data, global_step, args)

Naturally some util functions were also needed:

agent.py file contents:

import math
from typing import Dict, Tuple

import gymnasium as gym
import torch
import torch.nn.functional as F

from torch import Tensor
from torch.distributions import Categorical
from torchmetrics import MeanMetric

from lightning.pytorch import LightningModule

import torch
import torch.nn.functional as F
from torch import Tensor

def policy_loss(
    advantages: torch.Tensor, ratio: torch.Tensor, clip_coef: float
) -> torch.Tensor:
    pg_loss1 = -advantages * ratio
    pg_loss2 = -advantages * torch.clamp(ratio, 1 - clip_coef, 1 + clip_coef)
    return torch.max(pg_loss1, pg_loss2).mean()

def value_loss(
    new_values: Tensor,
    old_values: Tensor,
    returns: Tensor,
    clip_coef: float,
    clip_vloss: bool,
    vf_coef: float,
) -> Tensor:
    new_values = new_values.view(-1)
    if not clip_vloss:
        values_pred = new_values
    else:
        values_pred = old_values + torch.clamp(
            new_values - old_values, -clip_coef, clip_coef
        )
    return vf_coef * F.mse_loss(values_pred, returns)

def entropy_loss(entropy: Tensor, ent_coef: float) -> Tensor:
    return -entropy.mean() * ent_coef

def layer_init(
    layer: torch.nn.Module,
    std: float = math.sqrt(2),
    bias_const: float = 0.0,
    ortho_init: bool = True,
):
    if ortho_init:
        torch.nn.init.orthogonal_(layer.weight, std)
        torch.nn.init.constant_(layer.bias, bias_const)
    return layer

class PPOAgent(torch.nn.Module):
    def __init__(
        self,
        envs: gym.vector.SyncVectorEnv,
        act_fun: str = "relu",
        ortho_init: bool = False,
    ) -> None:
        super().__init__()
        if act_fun.lower() == "relu":
            act_fun = torch.nn.ReLU()
        elif act_fun.lower() == "tanh":
            act_fun = torch.nn.Tanh()
        else:
            raise ValueError(
                "Unrecognized activation function: `act_fun` must be either `relu` or `tanh`"
            )
        self.critic = torch.nn.Sequential(
            layer_init(
                torch.nn.Linear(math.prod(envs.single_observation_space.shape), 64),
                ortho_init=ortho_init,
            ),
            act_fun,
            layer_init(torch.nn.Linear(64, 64), ortho_init=ortho_init),
            act_fun,
            layer_init(torch.nn.Linear(64, 1), std=1.0, ortho_init=ortho_init),
        )
        self.actor = torch.nn.Sequential(
            layer_init(
                torch.nn.Linear(math.prod(envs.single_observation_space.shape), 64),
                ortho_init=ortho_init,
            ),
            act_fun,
            layer_init(torch.nn.Linear(64, 64), ortho_init=ortho_init),
            act_fun,
            layer_init(
                torch.nn.Linear(64, envs.single_action_space.n),
                std=0.01,
                ortho_init=ortho_init,
            ),
        )

    def get_action(
        self, x: Tensor, action: Tensor = None
    ) -> Tuple[Tensor, Tensor, Tensor]:
        logits = self.actor(x)
        distribution = Categorical(logits=logits)
        if action is None:
            action = distribution.sample()
        return action, distribution.log_prob(action), distribution.entropy()

    def get_greedy_action(self, x: Tensor) -> Tensor:
        logits = self.actor(x)
        probs = F.softmax(logits, dim=-1)
        return torch.argmax(probs, dim=-1)

    def get_value(self, x: Tensor) -> Tensor:
        return self.critic(x)

    def get_action_and_value(
        self, x: Tensor, action: Tensor = None
    ) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
        action, log_prob, entropy = self.get_action(x, action)
        value = self.get_value(x)
        return action, log_prob, entropy, value

    def forward(
        self, x: Tensor, action: Tensor = None
    ) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
        return self.get_action_and_value(x, action)

    @torch.no_grad()
    def estimate_returns_and_advantages(
        self,
        rewards: Tensor,
        values: Tensor,
        dones: Tensor,
        next_obs: Tensor,
        next_done: Tensor,
        num_steps: int,
        gamma: float,
        gae_lambda: float,
    ) -> Tuple[Tensor, Tensor]:
        next_value = self.get_value(next_obs).reshape(1, -1)
        advantages = torch.zeros_like(rewards)
        lastgaelam = 0
        for t in reversed(range(num_steps)):
            if t == num_steps - 1:
                nextnonterminal = torch.logical_not(next_done)
                nextvalues = next_value
            else:
                nextnonterminal = torch.logical_not(dones[t + 1])
                nextvalues = values[t + 1]
            delta = rewards[t] + gamma * nextvalues * nextnonterminal - values[t]
            advantages[t] = lastgaelam = (
                delta + gamma * gae_lambda * nextnonterminal * lastgaelam
            )
        returns = advantages + values
        return returns, advantages

class PPOLightningAgent(LightningModule):
    def __init__(
        self,
        envs: gym.vector.SyncVectorEnv,
        act_fun: str = "relu",
        ortho_init: bool = False,
        vf_coef: float = 1.0,
        ent_coef: float = 0.0,
        clip_coef: float = 0.2,
        clip_vloss: bool = False,
        normalize_advantages: bool = False,
        **torchmetrics_kwargs,
    ):
        super().__init__()
        if act_fun.lower() == "relu":
            act_fun = torch.nn.ReLU()
        elif act_fun.lower() == "tanh":
            act_fun = torch.nn.Tanh()
        else:
            raise ValueError(
                "Unrecognized activation function: `act_fun` must be either `relu` or `tanh`"
            )
        self.vf_coef = vf_coef
        self.ent_coef = ent_coef
        self.clip_coef = clip_coef
        self.clip_vloss = clip_vloss
        self.normalize_advantages = normalize_advantages
        self.critic = torch.nn.Sequential(
            layer_init(
                torch.nn.Linear(math.prod(envs.single_observation_space.shape), 64),
                ortho_init=ortho_init,
            ),
            act_fun,
            layer_init(torch.nn.Linear(64, 64), ortho_init=ortho_init),
            act_fun,
            layer_init(torch.nn.Linear(64, 1), std=1.0, ortho_init=ortho_init),
        )
        self.actor = torch.nn.Sequential(
            layer_init(
                torch.nn.Linear(math.prod(envs.single_observation_space.shape), 64),
                ortho_init=ortho_init,
            ),
            act_fun,
            layer_init(torch.nn.Linear(64, 64), ortho_init=ortho_init),
            act_fun,
            layer_init(
                torch.nn.Linear(64, envs.single_action_space.n),
                std=0.01,
                ortho_init=ortho_init,
            ),
        )
        self.avg_pg_loss = MeanMetric(**torchmetrics_kwargs)
        self.avg_value_loss = MeanMetric(**torchmetrics_kwargs)
        self.avg_ent_loss = MeanMetric(**torchmetrics_kwargs)

    def get_action(
        self, x: Tensor, action: Tensor = None
    ) -> Tuple[Tensor, Tensor, Tensor]:
        logits = self.actor(x)
        distribution = Categorical(logits=logits)
        if action is None:
            action = distribution.sample()
        return action, distribution.log_prob(action), distribution.entropy()

    def get_greedy_action(self, x: Tensor) -> Tensor:
        logits = self.actor(x)
        probs = F.softmax(logits, dim=-1)
        return torch.argmax(probs, dim=-1)

    def get_value(self, x: Tensor) -> Tensor:
        return self.critic(x)

    def get_action_and_value(
        self, x: Tensor, action: Tensor = None
    ) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
        action, log_prob, entropy = self.get_action(x, action)
        value = self.get_value(x)
        return action, log_prob, entropy, value

    def forward(
        self, x: Tensor, action: Tensor = None
    ) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
        return self.get_action_and_value(x, action)

    @torch.no_grad()
    def estimate_returns_and_advantages(
        self,
        rewards: Tensor,
        values: Tensor,
        dones: Tensor,
        next_obs: Tensor,
        next_done: Tensor,
        num_steps: int,
        gamma: float,
        gae_lambda: float,
    ) -> Tuple[Tensor, Tensor]:
        next_value = self.get_value(next_obs).reshape(1, -1)
        advantages = torch.zeros_like(rewards)
        lastgaelam = 0
        for t in reversed(range(num_steps)):
            if t == num_steps - 1:
                nextnonterminal = torch.logical_not(next_done)
                nextvalues = next_value
            else:
                nextnonterminal = torch.logical_not(dones[t + 1])
                nextvalues = values[t + 1]
            delta = rewards[t] + gamma * nextvalues * nextnonterminal - values[t]
            advantages[t] = lastgaelam = (
                delta + gamma * gae_lambda * nextnonterminal * lastgaelam
            )
        returns = advantages + values
        return returns, advantages

    def training_step(self, batch: Dict[str, Tensor]):
        # Get actions and values given the current observations
        _, newlogprob, entropy, newvalue = self(batch["obs"], batch["actions"].long())
        logratio = newlogprob - batch["logprobs"]
        ratio = logratio.exp()

        # Policy loss
        advantages = batch["advantages"]
        if self.normalize_advantages:
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        pg_loss = policy_loss(batch["advantages"], ratio, self.clip_coef)

        # Value loss
        v_loss = value_loss(
            newvalue,
            batch["values"],
            batch["returns"],
            self.clip_coef,
            self.clip_vloss,
            self.vf_coef,
        )

        # Entropy loss
        ent_loss = entropy_loss(entropy, self.ent_coef)

        # Update metrics
        self.avg_pg_loss(pg_loss)
        self.avg_value_loss(v_loss)
        self.avg_ent_loss(ent_loss)

        # Overall loss
        return pg_loss + ent_loss + v_loss

    def on_train_epoch_end(self, global_step: int) -> None:
        # Log metrics and reset their internal state
        self.logger.log_metrics(
            {
                "Loss/policy_loss": self.avg_pg_loss.compute(),
                "Loss/value_loss": self.avg_value_loss.compute(),
                "Loss/entropy_loss": self.avg_ent_loss.compute(),
            },
            global_step,
        )
        self.reset_metrics()

    def reset_metrics(self):
        self.avg_pg_loss.reset()
        self.avg_value_loss.reset()
        self.avg_ent_loss.reset()

    def configure_optimizers(self, lr: float):
        return torch.optim.Adam(self.parameters(), lr=lr, eps=1e-4)

Error messages and logs

Running the commands as highlighted above ends in the following error when running from a virtual environment with lightning installed, and torch cuda installed:

INFO: Lightning is running from outside your current environment. Switching to your current environment.
The lightning package is not installed. Would you like to install it? [Y/n (exit)]:

When trying to go through with installation, the lightning is installed (somewhere), and attempts to run the code but fails due to GPU (allegedly) not being available.

Environment

Current environment * CUDA: - GPU: - NVIDIA GeForce GTX 1660 SUPER - available: True - version: 11.8 * Lightning: - lightning: 2.0.3 - lightning-cloud: 0.5.36 - lightning-utilities: 0.8.0 - pytorch-lightning: 2.0.3 - torch: 2.0.1+cu118 - torchaudio: 2.0.2+cu118 - torchmetrics: 0.11.4 - torchvision: 0.15.2+cu118 * Packages: - absl-py: 1.4.0 - aiohttp: 3.8.4 - aiosignal: 1.3.1 - ansicon: 1.89.0 - anyio: 3.7.0 - arrow: 1.2.3 - async-timeout: 4.0.2 - attrs: 23.1.0 - beautifulsoup4: 4.12.2 - black: 23.3.0 - blessed: 1.20.0 - box2d-py: 2.3.5 - cachetools: 5.3.1 - certifi: 2023.5.7 - charset-normalizer: 3.1.0 - click: 8.1.3 - cloudpickle: 2.2.1 - colorama: 0.4.6 - croniter: 1.3.15 - dateutils: 0.6.12 - decorator: 4.4.2 - deepdiff: 6.3.0 - farama-notifications: 0.0.4 - fastapi: 0.88.0 - filelock: 3.12.0 - frozenlist: 1.3.3 - fsspec: 2023.5.0 - google-auth: 2.19.1 - google-auth-oauthlib: 1.0.0 - grpcio: 1.54.2 - gymnasium: 0.28.1 - h11: 0.14.0 - idna: 3.4 - imageio: 2.31.0 - imageio-ffmpeg: 0.4.8 - inquirer: 3.1.3 - itsdangerous: 2.1.2 - jax-jumpy: 1.0.0 - jinja2: 3.1.2 - jinxed: 1.2.0 - lightning: 2.0.3 - lightning-cloud: 0.5.36 - lightning-utilities: 0.8.0 - markdown: 3.4.3 - markdown-it-py: 2.2.0 - markupsafe: 2.1.3 - mdurl: 0.1.2 - moviepy: 1.0.3 - mpmath: 1.3.0 - multidict: 6.0.4 - mypy-extensions: 1.0.0 - networkx: 3.1 - numpy: 1.24.3 - oauthlib: 3.2.2 - ordered-set: 4.1.0 - packaging: 23.1 - pathspec: 0.11.1 - pillow: 9.5.0 - pip: 22.3 - platformdirs: 3.5.1 - proglog: 0.1.10 - protobuf: 4.23.2 - psutil: 5.9.5 - pyasn1: 0.5.0 - pyasn1-modules: 0.3.0 - pydantic: 1.10.9 - pygame: 2.1.3 - pygments: 2.15.1 - pyjwt: 2.7.0 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-multipart: 0.0.6 - pytorch-lightning: 2.0.3 - pytz: 2023.3 - pyyaml: 6.0 - readchar: 4.0.5 - requests: 2.31.0 - requests-oauthlib: 1.3.1 - rich: 13.4.1 - rsa: 4.9 - setuptools: 65.5.0 - six: 1.16.0 - sniffio: 1.3.0 - soupsieve: 2.4.1 - starlette: 0.22.0 - starsessions: 1.3.0 - swig: 4.1.1 - sympy: 1.12 - tensorboard: 2.13.0 - tensorboard-data-server: 0.7.0 - torch: 2.0.1+cu118 - torchaudio: 2.0.2+cu118 - torchmetrics: 0.11.4 - torchvision: 0.15.2+cu118 - tqdm: 4.65.0 - traitlets: 5.9.0 - typing-extensions: 4.6.3 - urllib3: 1.26.16 - uvicorn: 0.22.0 - wcwidth: 0.2.6 - websocket-client: 1.5.2 - websockets: 11.0.3 - werkzeug: 2.3.6 - wheel: 0.40.0 - yarl: 1.9.2 * System: - OS: Windows - architecture: - 64bit - WindowsPE - processor: AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD - python: 3.11.0 - release: 10 - version: 10.0.19041

More info

If possible it would be great to be able to run the examples locally from within a virtual environment.

cc @carmocca @justusschock @awaelchli

awaelchli commented 1 year ago

@Kaszanas I think the blog post could have pointed to the full implementation here. Hopefully this has all the code you were missing.

Regarding the environment issue: Make sure to create a fresh virtual environment with virtualenv or conda. Make sure that, after activating the environment, which python and which pipreturns the python executable from your environment, not the system python. You want to make sure that when you pip install lightning, it installs it into your environment, not the system-wide packages. The warning message we print "INFO: Lightning is running from outside your current environment. Switching to your current environment." is basically complaining about that. It sees that you are in an environment but it recognizes that the python command runs from outside of it.

Kaszanas commented 1 year ago

I will let you know later if I am able to fix it i create my virtual environments using python -m venv venv, and activate it manually venv/Scripts/activate

In the environment dump you can see that all of the packages installed, and this doesn't take into consideration that pip install lightning doesn't come with CUDA support out of the box torch with CUDA needs to be installed separately on top of that.

Additionally, the full implementation imports parts of code that are not exposed when installing lightning. So these examples are not accessible for anyone that wouldn't like to clone the repository. Even when cloning the repository anyone that wants to work with the examples needs to get to know all of the lightning repository setup. There's no good way to work with that in my opinion.

Kaszanas commented 1 year ago

Okay, I went back to diagnose this problem.

lightning is in fact installed as an .exe file in my virtual environment (notice it at the bottom):

image

I am sure I am running with my virtual environment activated.

See the following:

(venv) PS G:\SomePath\SomeProject> .\venv\Scripts\lightning.exe run model --accelerator=gpu --strategy=ddp --devices=1 src/main.py --capture-video --env-id CartPole-v1 --total-timesteps 100000 --num-envs 2 --num-steps 512  
INFO: Lightning is running from outside your current environment. Switching to your current environment.
The lightning package is not installed. Would you like to install it? [Y/n (exit)]: 

If you need any more information, let me know.

awaelchli commented 1 year ago

Where does where lightning, where pip, where python point to?

Kaszanas commented 1 year ago

Result of the command (with paths anonymized) lightning:

G:\SomePath\SomeProject\venv\Scripts\lightning.exe

pip:

G:\SomePath\SomeProject\venv\Scripts\pip.exe
... Other paths

python

G:\SomePath\SomeProject\venv\Scripts\python.exe
... Other paths

In that regard the first path should take priority, that is why I did not include all of the other paths, and it seems that first returned path for my directory is in a correct place.

awaelchli commented 1 year ago

This looks good, so I am surprised. This must mean that

python -c "import sys; print(sys.executable)"

is probably returning a different path (to a different python?). Can you verify this?

And what happens if you put this line into an empty file test_import.py

import lightning

and run python test_import.py?

Kaszanas commented 1 year ago

Running the command:

python -c "import sys; print(sys.executable)"

Returns:

G:\SomePath\SomeProject\venv\Scripts\python.exe

Running a test script from any path:

python .\src\test_import.py
cd src
python test_import.py

finishes execution without errors. See that is why I was confused as to why the lightning command from the CLI fails.

shuxiaobo commented 1 month ago

same problem, is there any solution? @awaelchli

awaelchli commented 1 month ago

The lightning command no longer does this in recent versions. So the solution is to simply upgrade the lightning version. To run fabric scripts, simply run them with Python, or use the fabric run command now.

Kaszanas commented 1 month ago

I will try to verify if this is resolved.

Hopefully I can get it to work 😉