Open Erotemic opened 1 year ago
@cjrd @RitwikGupta I'm trying to get a MWE of this running. With the latest changes you can do something like this:
# Create demo train / vali data
DATA_PATH=$(python -m scalemae.demo)
echo "
data:
type: ImageList
length: 10
img_dir: '$DATA_PATH'
mean: [0.46921533, 0.46026663, 0.41329921]
std: [0.1927, 0.1373, 0.1203]
vis_factor: 1.0
" > $DATA_PATH/demo.yaml
cat $DATA_PATH/demo.yaml
DEFAULT_ROOT_DIR=$HOME/exps/scalemae_demo
echo "
DEFAULT_ROOT_DIR = $DEFAULT_ROOT_DIR
DATA_PATH = $DATA_PATH
"
mkdir -p $DEFAULT_ROOT_DIR
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=1 --master_port=11085 -m scalemae.main_pretrain \
--output_dir $DEFAULT_ROOT_DIR \
--log_dir $DEFAULT_ROOT_DIR \
--config $DATA_PATH/demo.yaml \
--eval_path "$DATA_PATH" \
--batch_size 4 \
--model mae_vit_base_patch16 \
--mask_ratio 0.75 \
--num_workers 0 \
--epochs 300 \
--target_size 224\
--input_size 224\
--self_attention\
--scale_min 0.2 \
--scale_max 1.0 \
--warmup_epochs 40 \
--blr 1.5e-4 --weight_decay 0.05 \
--decoder_aux_loss_layers 1\
--target_size_scheduler constant\
--decoder_depth 8 \
--no_autoresume \
--use_mask_token \
--skip_knn_eval \
--fixed_output_size_min 224\
--fixed_output_size_max 336\
--absolute_scale
This generates a small dataset with kwcoco, so it can grow larger if needed. I'm able to write an ImageFolder that should corresond to one of the dataloaders. I thought the above would run, but I got:
RuntimeError: Unexpected error from cudaGetDeviceCount().
This could just be a hardware problem (can this not run on 2x 3090's?). Is there anything obviously wrong about my config?
Are there recommended settings for attempting to reproduce the pipeline on a small dataset (for testing).
Jon, your config looks ok, but the issue seems to be with your environment. It seems that PyTorch is unable to see your GPUs. Can you verify everything is set up correctly?
Yes, I'm currently training a geowatch network with 2 GPUs using LightningCLI.
An extended version of python -m torch.utils.collect_env
with more relevant package output is:
@Erotemic I was able to take a look at this again. The environment was set up for me properly. Can you install packages in your environment step-by-step and see where your env breaks?
@RitwikGupta I've made a MWE in a docker image, and I was able to get farther. It's likely something on my host system is weird.
To that end, I've added a dockerfile and instructions that walk through my MWE. It still is giving me an error, but it has to do with not having a CRS for the dataset. This makes sense because kwcoco demo data doesn't contain geo-metadata. However, geowatch demodata does have CRS information, so I'll see if I can get farther by using that.
Hmm, it looks like I still get an error:
File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torchgeo/datasets/geo.py", line 83, in GeoDataset
_crs = CRS.from_epsg(4326)
File "rasterio/crs.pyx", line 590, in rasterio.crs.CRS.from_epsg
rasterio.errors.CRSError: The EPSG code is unknown.
PROJ: internal_proj_create_from_database:
/opt/conda/envs/scalemae/share/proj/proj.db lacks
DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata.
It comes from another PROJ installation.
This docker env is:
This is a common env issue with rasterio. You should conda install rasterio
instead of pip installing it
The conda variant of rasterio works (I do hope to get this working where conda is no longer necessary, but that's for after I get the basic case working).
Unfortuantely, I'm still getting errors:
Root Cause (first observed failure):
[0]:
time : 2023-11-28_14:45:34
host : 168b53aa1722
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 998)
error_file: /tmp/torchelastic_m66l_uvu/none_i0_vmh1r/attempt_0/0/error.json
traceback : Traceback (most recent call last):
File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/code/scalemae/scalemae/main_pretrain.py", line 409, in main
misc.init_distributed_mode(args)
File "/root/code/scalemae/scalemae/util/misc.py", line 264, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/cuda/__init__.py", line 404, in set_device
torch._C._cuda_setDevice(device)
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
Do you have the details for the environment where you've gotten it to work? Torch versions / etc...?
EDIT: I'm getting farther (I've got versions sorted out - although still would be nice to know exactly which version you had in your env to make it work). Currently running into an issue that I think is due to the hard-coded datasets:
traceback : Traceback (most recent call last):
File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/code/scale-mae/scalemae/main_pretrain.py", line 717, in main
train_stats = train_one_epoch(
File "/root/code/scale-mae/scalemae/engine_pretrain.py", line 57, in train_one_epoch
for data_iter_step, ((samples, res, targets, target_res), metadata) in enumerate(
File "/root/code/scale-mae/scalemae/util/misc.py", line 144, in log_every
for obj in iterable:
File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/root/code/scale-mae/scalemae/dataloaders/utils.py", line 148, in __call__
imgs = torch.stack(list(zip(*samples))[0])
TypeError: expected Tensor as element 0 in argument 0, but got Image
I may be able to work through this one. But if you'll allow me to rant for a moment: this is the reason why I've built kwcoco and the dataloader in geowatch. The fact that you can't just swap datasets in / out as modules in research repos makes them far harder to use / reproduce / extend than they should be. Torchgeo doesn't solve this problem: it makes it worse by having a specific dataset class for specific datasets. There should be a generic dataset that points to a metadata manifest file. The process of dataloading should be entirely abstracted away from the ML research. The current practice of hard coding everything leads to too many frustrations. There needs to be a standardized vision dataset interchange that's expressive enough to capture the nuances of different vision problems. I'm attemption to make kwcoco that format, but really I'd be happy if anything standard and easy-to-use existed. In any case, if I do get this working you should expect that the updated code will be able to point to a kwcoco dataset and just run on it <\/rant>
PyTorch 1.13.1 should work, try that out.
I'm looking into integrating ScaleMAE into geowatch. I've made this branch to track modifications to make it work. Currently this involves:
Setting up proper package namespaces: Everything should be referenced under the "scalemae" namespace to allow for integrations with other libraries. Having a module named "lib" is a common anti-pattern in repos, as it leads to conflicts, and simply putting everything into a top-level namespace fixes this issue. It also means all imports are now referenced explicitly in the code itself.
Finding minimum versions of required and optional dependencies. Still working on this, but there doesn't seem to be a comprehensive list of requirements to make the repo work. I'm working on gathering those while also deconflicting with requirements of geowatch.
Linting to remove unused code.
This should not be merged yet. I'm just ensuring the work is pushed as it is developed for comments and visibility.