aditya-grover / climate-learn

Source code for ClimateLearn
MIT License
302 stars 49 forks source link

TypeError: DataModule.__init__() got an unexpected keyword argument 'dataset' #99

Closed linustws closed 1 year ago

linustws commented 1 year ago

Describe the bug Got this error: TypeError: DataModule.init() got an unexpected keyword argument 'dataset', even though the docs stated that it's part of the parameters.

To Reproduce Steps to reproduce the behavior: I just ran this:

from climate_learn.utils.datetime import Year, Days, Hours
from climate_learn.data import DataModule

data_module = DataModule(
    dataset = "ERA5",
    task = "forecasting",
    root_dir = "/content/drive/MyDrive/Climate/.climate_tutorial/data/weatherbench/era5/5.625/",
    in_vars = ["2m_temperature"],
    out_vars = ["2m_temperature"],
    train_start_year = Year(1979),
    val_start_year = Year(2015),
    test_start_year = Year(2017),
    end_year = Year(2018),
    pred_range = Days(3),
    subsample = Hours(6),
    batch_size = 128,
    num_workers = 1
)

Error traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-11-6ff523560500>](https://localhost:8080/#) in <cell line: 4>()
      2 from climate_learn.data import DataModule
      3 
----> 4 data_module = DataModule(
      5     dataset = "ERA5",
      6     task = "forecasting",

TypeError: DataModule.__init__() got an unexpected keyword argument 'dataset'

Expected behavior Code can run smoothly

Screenshots

image image

Environment

jasonjewik commented 1 year ago

Hi @linustws, thanks for using our package. I apologize that the code is not working as you expect. The docs are currently out-of-date with the version currently available on the main branch. We are working on refactoring and updating these docs now - aiming to be done in the next week. I'll update you here when we've completed the work. Please do not hesitate to let us know if you have any other questions.

linustws commented 1 year ago

Hi @jasonjewik, is there any quick fix to the bug right now? Do I need to change the parameter names? I would like to finish going through the tutorial asap as I have a deadline this weekend. 🙏

jasonjewik commented 1 year ago

This quickstart script for the forecasting task should work. Let me know if it doesn't.

https://gist.github.com/jasonjewik/b611e836f7fd8dbcb485a4eefb09035b

jasonjewik commented 1 year ago

Paging @prakhar6sharma to also take a look at this, I might have made some errors with the data loading code.

CristiFati commented 1 year ago

Argument explanations most likely come from another function / method (initializer), as they have nothing to do with https://github.com/aditya-grover/climate-learn/blob/28fa4a5abea0b37e392135f6e85709dc81bb88e4/src/climate_learn/data/module.py#LL72C13-L72C13.

prakhar6sharma commented 1 year ago

https://gist.github.com/jasonjewik/b611e836f7fd8dbcb485a4eefb09035b

On line 15, years=range(1979, 2017), should be changed to years=range(1979, 2015),, as we don't want to train on validation data.

prakhar6sharma commented 1 year ago

Argument explanations most likely come from another function / method (initializer), as they have nothing to do with https://github.com/aditya-grover/climate-learn/blob/28fa4a5abea0b37e392135f6e85709dc81bb88e4/src/climate_learn/data/module.py#LL72C13-L72C13.

Hi @CristiFati, apologies for this confusion. The doc strings corresponding to file were written for an older version of code and haven't been updated corresponding to our new refactoring.

prakhar6sharma commented 1 year ago

Till the time, we don't update the documentation. Please feel free to comment here for any confusion regarding the code and I would try my best to get back to you on this ASAP.

Also you can also take a look at the notebooks 1, 2 and 3. As last time I checked they were working perfectly.

linustws commented 1 year ago

This quickstart script for the forecasting task should work. Let me know if it doesn't.

https://gist.github.com/jasonjewik/b611e836f7fd8dbcb485a4eefb09035b

The code that you provided only has variables whereas the one I'm referring to has separated variables into in and out variables with both being '2m_temperature':

from climate_learn.utils.datetime import Year, Days, Hours
from climate_learn.data import DataModule

data_module = DataModule(
    dataset = "ERA5",
    task = "forecasting",
    root_dir = "/content/drive/MyDrive/Climate/.climate_tutorial/data/weatherbench/era5/5.625/",
    in_vars = ["2m_temperature"],
    out_vars = ["2m_temperature"],
    train_start_year = Year(1979),
    val_start_year = Year(2015),
    test_start_year = Year(2017),
    end_year = Year(2018),
    pred_range = Days(3),
    subsample = Hours(6),
    batch_size = 128,
    num_workers = 1
)

The loading of the model also uses in and out variables whereas the code you provided only has in variables (in_channels):

model_kwargs = {
    "in_channels": len(data_module.hparams.in_vars),
    "out_channels": len(data_module.hparams.out_vars),
    "n_blocks": 4
}

optim_kwargs = {
    "lr": 1e-4,
    "weight_decay": 1e-5,
    "warmup_epochs": 1,
    "max_epochs": 5,
}

# model_module = load_model(name = "vit", task = "forecasting", model_kwargs = model_kwargs, optim_kwargs = optim_kwargs)
model_module = load_model(name = "resnet", task = "forecasting", model_kwargs = model_kwargs, optim_kwargs = optim_kwargs)
# model_module = load_model(name = "unet", task = "forecasting", model_kwargs = model_kwargs, optim_kwargs = optim_kwargs)

Whats the correct way to define the in and out variables/channels using your current version?

prakhar6sharma commented 1 year ago

Attaching some snippets of code along with comments that should help you here.

# location of the folder where the data is stored
data_dir = "/data/weatherbench/era5/5.625deg/"
# list with the name of input variables
variables = [
        "geopotential",
        "u_component_of_wind",
        "v_component_of_wind",
        "temperature",
        "specific_humidity",
        "toa_incident_solar_radiation",
        "2m_temperature"
    ]
# list with the name of constants fields
constants = ["land_sea_mask", "orography", "lattitude"]
# list with the name of output variables
out_vars = ["geopotential_500", "temperature_850", "2m_temperature"] 
# lead time for prediction in number of hours
pred_range = 3 * 24 
# subsample data for every 6 hours
subsample = 6 
# concatenate the input data corresponding current timestamp along with previous 2 timestamps
history = 3
# when concatenating previous time stamps for the purpose of history what should be the difference between them
window = 6
# training years
train_years = range(1979, 2016)
# validation years
val_years = range(2016, 2017)
# testing years
test_years = range(2017, 2019)
# number of chunks to divide the dataset in for the purpose of sharding
n_chunks = 5

climate_dataset_args = ERA5Args(
    root_dir=data_dir,
    variables=variables,
    constants=constants,
    years=train_years,
)

forecasting_args = ForecastingArgs(
    in_vars=variables,
    out_vars=out_vars,
    constants=constants,
    pred_range=pred_range,
    subsample=subsample,
    history=history,
    window=window,
)
train_dataset_args = ShardDatasetArgs(climate_dataset_args, forecasting_args, n_chunks)

climate_dataset_args = ERA5Args(
    root_dir=data_dir,
    variables=variables,
    constants=constants,
    years=val_years,
)

# we are not implementing the sharding for validation and test data; as we can load all of it in memory
val_dataset_args = MapDatasetArgs(climate_dataset_args, forecasting_args)

# since the testing data has exactly same arguments as validation data; we can create a copy of it and specify which
# arguments to change
modified_args_for_test_dataset = {
    "climate_dataset_args": {"years": test_years, "split": "test"}
}
test_dataset_args = val_dataset_args.create_copy(modified_args_for_test_dataset)

# currently if you are using ShardDataset, please use num_workers as 0
# if you are using MapDataset, then go ahead with any number of workers
data_module = DataModule(
    train_dataset_args,
    val_dataset_args,
    test_dataset_args,
    batch_size=64,
    num_workers=0,
)

The above snippet would correspond to the data loading.

model_kwargs = {
    "in_channels": 40,
    "out_channels": 3,
    "n_blocks": 28,
    "history": 3,
}

optim_kwargs = {
    "lr": 1e-5,
    "weight_decay": 1e-5,
    "warmup_epochs": warmup_epochs,
    "max_epochs": max_epochs,
}
model_module = load_model(
    name="resnet",
    task="forecasting",
    model_kwargs=model_kwargs,
    optim_kwargs=optim_kwargs,
)

This snippet would load the model for you.

prakhar6sharma commented 1 year ago

Again the notebooks [LINK] should help.

linustws commented 1 year ago

Thanks @prakhar6sharma. When I try to train the model using this:

from climate_learn.training import Trainer, WandbLogger

trainer = Trainer(
    seed = 0,
    accelerator = "gpu",
    precision = 16,
    max_epochs = 5,
    # logger = WandbLogger(project = "climate_tutorial", name = "forecast-vit")
)

I get this error:

INFO:lightning_fabric.utilities.seed:Global seed set to 0
---------------------------------------------------------------------------
MisconfigurationException                 Traceback (most recent call last)
[<ipython-input-20-28e1dfec6530>](https://localhost:8080/#) in <cell line: 3>()
      1 from climate_learn.training import Trainer, WandbLogger
      2 
----> 3 trainer = Trainer(
      4     seed = 0,
      5     accelerator = "gpu",

4 frames
[/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py](https://localhost:8080/#) in _lazy_init_strategy(self)
    580 
    581         if _IS_INTERACTIVE and self.strategy.launcher and not self.strategy.launcher.is_interactive_compatible:
--> 582             raise MisconfigurationException(
    583                 f"`Trainer(strategy={self._strategy_flag!r})` is not compatible with an interactive"
    584                 " environment. Run your code as a script, or choose one of the compatible strategies:"

MisconfigurationException: `Trainer(strategy='ddp_spawn')` is not compatible with an interactive environment. Run your code as a script, or choose one of the compatible strategies: `Fabric(strategy='dp'|'ddp_notebook')`. In case you are spawning processes yourself, make sure to include the Trainer creation inside the worker function.
jasonjewik commented 1 year ago

It seems like you are trying to run the code from a Python REPL. If you run the code as a script, this error should vanish.

CristiFati commented 1 year ago

Argument explanations most likely come from another function / method (initializer), as they have nothing to do with https://github.com/aditya-grover/climate-learn/blob/28fa4a5abea0b37e392135f6e85709dc81bb88e4/src/climate_learn/data/module.py#LL72C13-L72C13.

Hi @CristiFati, apologies for this confusion. The doc strings corresponding to file were written for an older version of code and haven't been updated corresponding to our new refactoring.

@prakhar6sharma: no need to apologize. I came with an explanation, without thoroughly investigating. I latter saw the commit (~8 months ago) that changed the API.

From my PoV, when doc is OOD, there's always the source code to look at. That's the beauty of open source.

linustws commented 1 year ago

It seems like you are trying to run the code from a Python REPL. If you run the code as a script, this error should vanish.

can i ask what u mean by this? must i run it using a script.py file? does this mean i cant run it using google colab?

jasonjewik commented 1 year ago

You understand me correctly. However, this is a bug in the current implementation. I will be opening a PR to resolve this shortly. Currently, awaiting review on a model/metrics overhaul we're doing.