[WIP] Upstream changes - Githubissues

Erotemic commented 1 year ago

I'm looking into integrating ScaleMAE into geowatch. I've made this branch to track modifications to make it work. Currently this involves:

Setting up proper package namespaces: Everything should be referenced under the "scalemae" namespace to allow for integrations with other libraries. Having a module named "lib" is a common anti-pattern in repos, as it leads to conflicts, and simply putting everything into a top-level namespace fixes this issue. It also means all imports are now referenced explicitly in the code itself.
Finding minimum versions of required and optional dependencies. Still working on this, but there doesn't seem to be a comprehensive list of requirements to make the repo work. I'm working on gathering those while also deconflicting with requirements of geowatch.
Linting to remove unused code.

This should not be merged yet. I'm just ensuring the work is pushed as it is developed for comments and visibility.

Erotemic commented 1 year ago

@cjrd @RitwikGupta I'm trying to get a MWE of this running. With the latest changes you can do something like this:

# Create demo train / vali data
DATA_PATH=$(python -m scalemae.demo)

echo "
data:
  type: ImageList
  length: 10
  img_dir: '$DATA_PATH'
  mean: [0.46921533, 0.46026663, 0.41329921]
  std: [0.1927, 0.1373, 0.1203]
  vis_factor: 1.0
" > $DATA_PATH/demo.yaml

cat  $DATA_PATH/demo.yaml

DEFAULT_ROOT_DIR=$HOME/exps/scalemae_demo

echo "
DEFAULT_ROOT_DIR      = $DEFAULT_ROOT_DIR
DATA_PATH             = $DATA_PATH
"

mkdir -p $DEFAULT_ROOT_DIR
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=1 --master_port=11085 -m scalemae.main_pretrain \
    --output_dir $DEFAULT_ROOT_DIR \
    --log_dir  $DEFAULT_ROOT_DIR \
    --config $DATA_PATH/demo.yaml \
    --eval_path "$DATA_PATH" \
    --batch_size 4 \
    --model mae_vit_base_patch16  \
    --mask_ratio 0.75 \
    --num_workers 0 \
    --epochs 300 \
    --target_size 224\
    --input_size 224\
    --self_attention\
    --scale_min 0.2 \
    --scale_max 1.0 \
    --warmup_epochs 40 \
    --blr 1.5e-4 --weight_decay 0.05 \
    --decoder_aux_loss_layers 1\
    --target_size_scheduler constant\
    --decoder_depth 8 \
    --no_autoresume \
    --use_mask_token \
    --skip_knn_eval \
    --fixed_output_size_min 224\
    --fixed_output_size_max 336\
    --absolute_scale

This generates a small dataset with kwcoco, so it can grow larger if needed. I'm able to write an ImageFolder that should corresond to one of the dataloaders. I thought the above would run, but I got:

RuntimeError: Unexpected error from cudaGetDeviceCount().

``` data: type: ImageList length: 10 img_dir: '/home/joncrall/.cache/scalemae/tests/demo/imagefolder' mean: [0.46921533, 0.46026663, 0.41329921] std: [0.1927, 0.1373, 0.1203] vis_factor: 1.0 DEFAULT_ROOT_DIR = /home/joncrall/exps/scalemae_demo DATA_PATH = /home/joncrall/.cache/scalemae/tests/demo/imagefolder /home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects `--local-rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( 2023-11-04 21:27:40.159162: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-11-04 21:27:40.201501: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-11-04 21:27:41.136947: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Starting pretrain Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/joncrall/code/watch/geowatch_tpl/submodules/scale-mae/scalemae/main_pretrain.py", line 771, in main(args) File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/joncrall/code/watch/geowatch_tpl/submodules/scale-mae/scalemae/main_pretrain.py", line 409, in main misc.init_distributed_mode(args) File "/home/joncrall/code/watch/geowatch_tpl/submodules/scale-mae/scalemae/util/misc.py", line 264, in init_distributed_mode torch.cuda.set_device(args.gpu) File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/cuda/__init__.py", line 350, in set_device torch._C._cuda_setDevice(device) File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init torch._C._cuda_init() RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 518179) of binary: /home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/bin/python Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/launch.py", line 196, in main() File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ scalemae.main_pretrain FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-11-04_21:27:41 host : toothbrush rank : 0 (local_rank: 0) exitcode : 1 (pid: 518179) error_file: /tmp/torchelastic_ktnaatzu/none_6fhs8u23/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/joncrall/code/watch/geowatch_tpl/submodules/scale-mae/scalemae/main_pretrain.py", line 409, in main misc.init_distributed_mode(args) File "/home/joncrall/code/watch/geowatch_tpl/submodules/scale-mae/scalemae/util/misc.py", line 264, in init_distributed_mode torch.cuda.set_device(args.gpu) File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/cuda/__init__.py", line 350, in set_device torch._C._cuda_setDevice(device) File "/home/joncrall/.pyenv/versions/3.11.2/envs/pyenv3.11.2/lib/python3.11/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init torch._C._cuda_init() RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW ============================================================ ```

This could just be a hardware problem (can this not run on 2x 3090's?). Is there anything obviously wrong about my config?

Are there recommended settings for attempting to reproduce the pipeline on a small dataset (for testing).

RitwikGupta commented 1 year ago

Jon, your config looks ok, but the issue seems to be with your environment. It seems that PyTorch is unable to see your GPUs. Can you verify everything is set up correctly?

Erotemic commented 1 year ago

Yes, I'm currently training a geowatch network with 2 GPUs using LightningCLI.

An extended version of python -m torch.utils.collect_env with more relevant package output is:

``` PyTorch version: 2.0.0+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: 14.0.0-1ubuntu1.1 CMake version: version 3.26.1 Libc version: glibc-2.35 Python version: 3.11.2 (main, Apr 1 2023, 18:27:37) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-6.2.0-36-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.0.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 Nvidia driver version: 525.147.05 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz CPU family: 6 Model: 167 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 1 CPU max MHz: 5300.0000 CPU min MHz: 800.0000 BogoMIPS: 7008.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap avx512ifma clflushopt intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities L1d cache: 384 KiB (8 instances) L1i cache: 256 KiB (8 instances) L2 cache: 4 MiB (8 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] classy-vision==0.7.0 [pip3] cmd-queue==0.1.19 [pip3] dvc==3.22.0 [pip3] dvc-azure==2.21.2 [pip3] dvc-data==2.16.1.dev0+g5326364.d20230907 [pip3] dvc-gdrive==2.19.2 [pip3] dvc-gs==2.22.1 [pip3] dvc-hdfs==2.19.0 [pip3] dvc-http==2.30.2 [pip3] dvc-objects==1.0.1 [pip3] dvc-oss==2.19.0 [pip3] dvc-render==0.5.3 [pip3] dvc-s3==2.23.0 [pip3] dvc-ssh==2.22.3.dev6+g773f905 [pip3] dvc-studio-client==0.10.0 [pip3] dvc-task==0.3.0 [pip3] dvc-webdav==2.19.1 [pip3] dvc-webhdfs==2.19.0 [pip3] efficientnet-pytorch==0.7.1 [pip3] einops==0.6.0 [pip3] GDAL==3.8.0 [pip3] geopandas==0.12.2 [pip3] geowatch==0.11.0 [pip3] kwarray==0.6.14 [pip3] kwcoco==0.7.2 [pip3] kwimage==0.9.21 [pip3] kwimage-ext==0.2.1 [pip3] kwutil==0.2.4 [pip3] lightning==2.1.0 [pip3] lightning-utilities==0.8.0 [pip3] matplotlib==3.7.1 [pip3] matplotlib-inline==0.1.6 [pip3] mmcv==2.0.0 [pip3] mypy==1.6.1 [pip3] mypy-boto3-s3==1.26.153 [pip3] mypy-extensions==1.0.0 [pip3] ndsampler==0.7.5 [pip3] numpy==1.26.1 [pip3] nvidia-cublas-cu11==11.10.3.66 [pip3] nvidia-cuda-cupti-cu11==11.7.101 [pip3] nvidia-cuda-nvrtc-cu11==11.7.99 [pip3] nvidia-cuda-runtime-cu11==11.7.99 [pip3] nvidia-cudnn-cu11==8.5.0.96 [pip3] nvidia-cufft-cu11==10.9.0.58 [pip3] nvidia-curand-cu11==10.2.10.91 [pip3] nvidia-cusolver-cu11==11.4.0.1 [pip3] nvidia-cusparse-cu11==11.7.4.91 [pip3] nvidia-nccl-cu11==2.14.3 [pip3] nvidia-nvtx-cu11==11.7.91 [pip3] opencv-python-headless==4.8.1.78 [pip3] pandas==1.5.3 [pip3] perceiver-pytorch==0.8.7 [pip3] performer-pytorch==1.1.4 [pip3] pytorch-lightning==2.0.8 [pip3] pytorch-msssim==0.1.5 [pip3] pytorch-ranger==0.1.1 [pip3] rasterio==1.3.5 [pip3] reformer-pytorch==1.4.4 [pip3] scikit-learn==1.2.2 [pip3] scipy==1.11.1 [pip3] scriptconfig==0.7.11 [pip3] seaborn==0.12.2 [pip3] segmentation-models-pytorch==0.3.3 [pip3] shapely==2.0.1 [pip3] simple-dvc==0.2.0 [pip3] tensorboard==2.14.0 [pip3] tensorboard-data-server==0.7.0 [pip3] tensorboard-plugin-wit==1.8.1 [pip3] tensorboardX==2.6 [pip3] tensorflow==2.12.0 [pip3] tensorflow-estimator==2.12.0 [pip3] tensorflow-io-gcs-filesystem==0.32.0 [pip3] timm==0.9.2 [pip3] torch==2.0.0+cu117 [pip3] torch-liberator==0.2.2 [pip3] torch-optimizer==0.3.0 [pip3] torchaudio==2.0.1+cu117 [pip3] torchgeo==0.5.0 [pip3] torchmetrics==0.11.4 [pip3] torchvision==0.15.1+cu117 [pip3] ubelt==1.3.4 [pip3] vit-pytorch==1.2.0 [conda] Could not collect ```

RitwikGupta commented 11 months ago

@Erotemic I was able to take a look at this again. The environment was set up for me properly. Can you install packages in your environment step-by-step and see where your env breaks?

Erotemic commented 11 months ago

@RitwikGupta I've made a MWE in a docker image, and I was able to get farther. It's likely something on my host system is weird.

To that end, I've added a dockerfile and instructions that walk through my MWE. It still is giving me an error, but it has to do with not having a CRS for the dataset. This makes sense because kwcoco demo data doesn't contain geo-metadata. However, geowatch demodata does have CRS information, so I'll see if I can get farther by using that.

Erotemic commented 11 months ago

Hmm, it looks like I still get an error:

  File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torchgeo/datasets/geo.py", line 83, in GeoDataset
    _crs = CRS.from_epsg(4326)
  File "rasterio/crs.pyx", line 590, in rasterio.crs.CRS.from_epsg

rasterio.errors.CRSError: The EPSG code is unknown. 
PROJ: internal_proj_create_from_database: 
/opt/conda/envs/scalemae/share/proj/proj.db lacks 
DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata. 
It comes from another PROJ installation.

``` /opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects `--local-rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( ERROR 1: PROJ: internal_proj_create_from_database: /opt/conda/envs/scalemae/share/proj/proj.db lacks DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata. It comes from another PROJ installation. Traceback (most recent call last): File "rasterio/crs.pyx", line 586, in rasterio.crs.CRS.from_epsg File "rasterio/_err.pyx", line 195, in rasterio._err.exc_wrap_int rasterio._err.CPLE_AppDefinedError: PROJ: internal_proj_create_from_database: /opt/conda/envs/scalemae/share/proj/proj.db lacks DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata. It comes from another PROJ installation. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/envs/scalemae/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/envs/scalemae/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/code/scalemae/scalemae/main_pretrain.py", line 48, in from scalemae.dataloaders.utils import get_dataset_and_sampler, get_eval_dataset_and_transform File "/root/code/scalemae/scalemae/dataloaders/utils.py", line 15, in from scalemae.dataloaders.naip import build_naip_sampler File "/root/code/scalemae/scalemae/dataloaders/naip.py", line 4, in from torchgeo.datasets import stack_samples File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torchgeo/datasets/__init__.py", line 6, in from .advance import ADVANCE File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torchgeo/datasets/advance.py", line 17, in from .geo import NonGeoDataset File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torchgeo/datasets/geo.py", line 42, in class GeoDataset(Dataset[dict[str, Any]], abc.ABC): File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torchgeo/datasets/geo.py", line 83, in GeoDataset _crs = CRS.from_epsg(4326) File "rasterio/crs.pyx", line 590, in rasterio.crs.CRS.from_epsg rasterio.errors.CRSError: The EPSG code is unknown. PROJ: internal_proj_create_from_database: /opt/conda/envs/scalemae/share/proj/proj.db lacks DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata. It comes from another PROJ installation. [2023-11-28 01:40:29,341] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 893) of binary: /opt/conda/envs/scalemae/bin/python Traceback (most recent call last): File "/opt/conda/envs/scalemae/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/envs/scalemae/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in main() File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ scalemae.main_pretrain FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-11-28_01:40:29 host : 168b53aa1722 rank : 0 (local_rank: 0) exitcode : 1 (pid: 893) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ```

This docker env is:

``` (scalemae) root@168b53aa1722:~/code/scalemae# python -m torch.utils.collect_env Collecting environment information... PyTorch version: 2.1.1 Is debug build: False CUDA used to build PyTorch: Could not collect ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35 Python version: 3.9.18 (main, Sep 11 2023, 13:41:44) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.2.0-36-generic-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: 11.6.124 CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 Nvidia driver version: 525.147.05 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.8.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.8.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz CPU family: 6 Model: 167 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 1 CPU max MHz: 5300.0000 CPU min MHz: 800.0000 BogoMIPS: 7008.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap avx512ifma clflushopt intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities L1d cache: 384 KiB (8 instances) L1i cache: 256 KiB (8 instances) L2 cache: 4 MiB (8 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] efficientnet-pytorch==0.7.1 [pip3] numpy==1.26.0 [pip3] pytorch-lightning==2.1.2 [pip3] pytorch-msssim==0.1.5 [pip3] pytorch-ranger==0.1.1 [pip3] segmentation-models-pytorch==0.3.2 [pip3] torch==2.1.1 [pip3] torch-liberator==0.2.1 [pip3] torch-optimizer==0.3.0 [pip3] torchaudio==2.1.1 [pip3] torchgeo==0.5.1 [pip3] torchmetrics==1.2.0 [pip3] torchvision==0.16.1 [conda] blas 1.0 mkl [conda] efficientnet-pytorch 0.7.1 pypi_0 pypi [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch [conda] mkl 2023.1.0 h213fc3f_46344 [conda] mkl-service 2.4.0 py39h5eee18b_1 [conda] mkl_fft 1.3.8 py39h5eee18b_0 [conda] mkl_random 1.2.4 py39hdb19cb5_0 [conda] numpy 1.26.0 py39h5f9d8c6_0 [conda] numpy-base 1.26.0 py39hb5e798b_0 [conda] pytorch 2.1.1 py3.9_cpu_0 pytorch [conda] pytorch-cuda 11.6 h867d48c_1 pytorch [conda] pytorch-lightning 2.1.2 pypi_0 pypi [conda] pytorch-msssim 0.1.5 pypi_0 pypi [conda] pytorch-mutex 1.0 cpu pytorch [conda] pytorch-ranger 0.1.1 pypi_0 pypi [conda] segmentation-models-pytorch 0.3.2 pypi_0 pypi [conda] torch-liberator 0.2.1 pypi_0 pypi [conda] torch-optimizer 0.3.0 pypi_0 pypi [conda] torchaudio 2.1.1 py39_cpu pytorch [conda] torchgeo 0.5.1 pypi_0 pypi [conda] torchmetrics 1.2.0 pypi_0 pypi [conda] torchvision 0.16.1 py39_cpu pytorch ```

RitwikGupta commented 11 months ago

This is a common env issue with rasterio. You should conda install rasterio instead of pip installing it

Erotemic commented 11 months ago

The conda variant of rasterio works (I do hope to get this working where conda is no longer necessary, but that's for after I get the basic case working).

Unfortuantely, I'm still getting errors:

Root Cause (first observed failure):
[0]:
  time      : 2023-11-28_14:45:34
  host      : 168b53aa1722
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 998)
  error_file: /tmp/torchelastic_m66l_uvu/none_i0_vmh1r/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
    File "/root/code/scalemae/scalemae/main_pretrain.py", line 409, in main
      misc.init_distributed_mode(args)
    File "/root/code/scalemae/scalemae/util/misc.py", line 264, in init_distributed_mode
      torch.cuda.set_device(args.gpu)
    File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/cuda/__init__.py", line 404, in set_device
      torch._C._cuda_setDevice(device)
  AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

Do you have the details for the environment where you've gotten it to work? Torch versions / etc...?

EDIT: I'm getting farther (I've got versions sorted out - although still would be nice to know exactly which version you had in your env to make it work). Currently running into an issue that I think is due to the hard-coded datasets:

  traceback : Traceback (most recent call last):
    File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
    File "/root/code/scale-mae/scalemae/main_pretrain.py", line 717, in main
      train_stats = train_one_epoch(
    File "/root/code/scale-mae/scalemae/engine_pretrain.py", line 57, in train_one_epoch
      for data_iter_step, ((samples, res, targets, target_res), metadata) in enumerate(
    File "/root/code/scale-mae/scalemae/util/misc.py", line 144, in log_every
      for obj in iterable:
    File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
      data = self._next_data()
    File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
      data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    File "/opt/conda/envs/scalemae/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
      return self.collate_fn(data)
    File "/root/code/scale-mae/scalemae/dataloaders/utils.py", line 148, in __call__
      imgs = torch.stack(list(zip(*samples))[0])
  TypeError: expected Tensor as element 0 in argument 0, but got Image

I may be able to work through this one. But if you'll allow me to rant for a moment: this is the reason why I've built kwcoco and the dataloader in geowatch. The fact that you can't just swap datasets in / out as modules in research repos makes them far harder to use / reproduce / extend than they should be. Torchgeo doesn't solve this problem: it makes it worse by having a specific dataset class for specific datasets. There should be a generic dataset that points to a metadata manifest file. The process of dataloading should be entirely abstracted away from the ML research. The current practice of hard coding everything leads to too many frustrations. There needs to be a standardized vision dataset interchange that's expressive enough to capture the nuances of different vision problems. I'm attemption to make kwcoco that format, but really I'd be happy if anything standard and easy-to-use existed. In any case, if I do get this working you should expect that the updated code will be able to point to a kwcoco dataset and just run on it <\/rant>

RitwikGupta commented 11 months ago

PyTorch 1.13.1 should work, try that out.

bair-climate-initiative / scale-mae

[WIP] Upstream changes #7