Danfoa / MorphoSymm

Tools for exploiting Morphological Symmetries in robotics
https://danfoa.github.io/MorphoSymm/
MIT License
45 stars 3 forks source link

Paper code doesn't run on current commit. Which commit of training code would allow me to replicate results of "On discrete symmetries of robotics systems"? #9

Closed DanielChaseButterfield closed 3 weeks ago

DanielChaseButterfield commented 3 months ago

I'm trying to replicate the results of "On discrete symmetries of robotics systems: A group-theoretic and data-driven analysis (RSS-2023)". Unfortunately, the provided code on the main branch has multiple issues.

I was able to rectify many of them on my own (like fixing changed path names, typos in files, etc) on my own fork, but I've recently run into an error involving core functionality of the algorithm that I don't fully understand.

After running the following command:

python train_supervised.py --multirun robot=mini_cheetah-c2 dataset=contact dataset.data_folder=training_splitted dataset.train_ratio=0.85 dataset.augment=True,False exp_name=contact_sample_eff_splitted model=contact_cnn model.lr=1e-4

I get this result:

pybullet build time: Nov 28 2023 23:52:03
[2024-07-17 11:44:53,692][HYDRA] Joblib.Parallel(n_jobs=-1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 2 jobs
[2024-07-17 11:44:53,692][HYDRA] Launching jobs, sweep output dir : morpho_symm/experiments/contact_sample_eff_splitted_mini_cheetah/model=CNN_train_ratio=0.85
[2024-07-17 11:44:53,692][HYDRA]        #0 : robot=mini_cheetah-c2 dataset=contact dataset.data_folder=training_splitted dataset.train_ratio=0.85 dataset.augment=True exp_name=contact_sample_eff_splitted model=contact_cnn model.lr=0.0001
[2024-07-17 11:44:53,692][HYDRA]        #1 : robot=mini_cheetah-c2 dataset=contact dataset.data_folder=training_splitted dataset.train_ratio=0.85 dataset.augment=False exp_name=contact_sample_eff_splitted model=contact_cnn model.lr=0.0001
pybullet build time: Nov 28 2023 23:52:03
pybullet build time: Nov 28 2023 23:52:03
[INFO][__main__] 

 NEW RUN 

Seed set to 476
[INFO][__main__] 

 NEW RUN 

Seed set to 403
Contact Dataset path: 
        - Data: /home/dbutterfield3/miniconda3/envs/morph/lib/python3.9/site-packages/morpho_symm/data/contact_dataset/training_splitted/numpy_train_ratio=0.850/train.npy 
        - Labels: /home/dbutterfield3/miniconda3/envs/morph/lib/python3.9/site-packages/morpho_symm/data/contact_dataset/training_splitted/numpy_train_ratio=0.850/train_label.npy
Contact Dataset path: 
        - Data: /home/dbutterfield3/miniconda3/envs/morph/lib/python3.9/site-packages/morpho_symm/data/contact_dataset/training_splitted/numpy_train_ratio=0.850/train.npy 
        - Labels: /home/dbutterfield3/miniconda3/envs/morph/lib/python3.9/site-packages/morpho_symm/data/contact_dataset/training_splitted/numpy_train_ratio=0.850/train_label.npy
[INFO][MorphoSymm] Loaded robot mini_cheetah, with defined group representations:
[INFO][MorphoSymm]       irrep_0: dimension: 1
[INFO][MorphoSymm]       irrep_1: dimension: 1
[INFO][MorphoSymm]       regular: dimension: 2
[INFO][MorphoSymm]       Q_js: dimension: 24
[INFO][MorphoSymm]       TqQ_js: dimension: 12
[INFO][MorphoSymm]       euler_xyz: dimension: 3
[INFO][MorphoSymm]       kin_chain: dimension: 4
[INFO][MorphoSymm]       R3: dimension: 3
[INFO][MorphoSymm]       E3: dimension: 4
[INFO][MorphoSymm]       R3_pseudo: dimension: 3
[INFO][MorphoSymm]       E3_pseudo: dimension: 4
[INFO][MorphoSymm]       SO3_flat: dimension: 9
[INFO][MorphoSymm] Loaded robot mini_cheetah, with defined group representations:
[INFO][MorphoSymm]       irrep_0: dimension: 1
[INFO][MorphoSymm]       irrep_1: dimension: 1
[INFO][MorphoSymm]       regular: dimension: 2
[INFO][MorphoSymm]       Q_js: dimension: 24
[INFO][MorphoSymm]       TqQ_js: dimension: 12
[INFO][MorphoSymm]       euler_xyz: dimension: 3
[INFO][MorphoSymm]       kin_chain: dimension: 4
[INFO][MorphoSymm]       R3: dimension: 3
[INFO][MorphoSymm]       E3: dimension: 4
[INFO][MorphoSymm]       R3_pseudo: dimension: 3
[INFO][MorphoSymm]       E3_pseudo: dimension: 4
[INFO][MorphoSymm]       SO3_flat: dimension: 9
Error executing job with overrides: ['robot=mini_cheetah-c2', 'dataset=contact', 'dataset.data_folder=training_splitted', 'dataset.train_ratio=0.85', 'dataset.augment=True', 'exp_name=contact_sample_eff_splitted', 'model=contact_cnn', 'model.lr=0.0001']
Traceback (most recent call last):
  File "/home/dbutterfield3/miniconda3/envs/morph/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/dbutterfield3/miniconda3/envs/morph/lib/python3.9/site-packages/hydra/_internal/utils.py", line 466, in <lambda>
    lambda: hydra.multirun(
  File "/home/dbutterfield3/miniconda3/envs/morph/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 162, in multirun
    ret = sweeper.sweep(arguments=task_overrides)
  File "/home/dbutterfield3/miniconda3/envs/morph/lib/python3.9/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 181, in sweep
    _ = r.return_value
  File "/home/dbutterfield3/miniconda3/envs/morph/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
ValueError: not enough values to unpack (expected 3, got 2)

The error is ultimately due to a mismatch between return arguments for load_symmetric_system() in utils/robot_utils.py and the expected returns from get_in_out_symmetry_groups_reps() in the data/contact_dataset/umich_contact_dataset.py:

def load_symmetric_system(
        robot_cfg: Optional[DictConfig] = None,
        robot_name: Optional[str] = None,
        debug=False
        ) -> [PinBulletWrapper, escnn.group.Group]:
  @staticmethod
    def get_in_out_symmetry_groups_reps(robot_cfg: DictConfig):
        from morpho_symm.groups.SparseRepresentation import SparseRep
        robot, rep_E3, rep_QJ = load_symmetric_system(robot_cfg)

I don't fully understand the intricacies of this code, so I don't want to just remove expected return values.

This leads me to the main request for this issue. My goal in editing the code was simply to replicate the results of the paper "On discrete symmetries of robotics systems: A group-theoretic and data-driven analysis (RSS-2023)", but clearly the main branch has long since diverged from the code that was run for the paper. Could you provide the commit that contains the code that was run for this paper?

Danfoa commented 3 months ago

Hi @DanielChaseButterfield,

Indeed the current state of the repository has diverged largely from the state of the RSS publication time. I apologize for not paying attention on the reproducibility of the experiments. If you give me a couple of days I will try to bring them back to operation.

DanielChaseButterfield commented 3 months ago

@Danfoa No worries; thanks so much for the help!

If it's any use, I have some trivial fixes completed in my own fork (https://github.com/lunarlab-gatech/MorphoSymm), and I can open a pull request to a development branch or a new branch if you'd think that would save you some time.

Danfoa commented 3 months ago

Hi @DanielChaseButterfield,

Question: Are you interested in the contact estimation or the CoM-momentum regression experiment?

From the time of the RSS publication I migrated the Equiv-NN backed from EMLP to ESCNN. The CoM-momentum regression is quite simply adapted to the new backend, but in the new backend I would have to define the equiv version of the contact-CNN, this might take some time.

Let me know which experiment is of interest or if both interest you.

DanielChaseButterfield commented 3 months ago

We're only comparing against the contact-CNN, which unfortunately sounds to be the more difficult of the two; but yeah we aren't planning on comparing against the CoM-momentum regression.

I'm sure redefining the contact-CNN for the new code would be difficult; is it possible that one of the commits from before the time of RSS publication would work? I feel like that might save you some time.

DanielChaseButterfield commented 2 months ago

@Danfoa Another option that could potentially reduce your workload. My main purpose in replicating the experiment was to do the following two things:

image

If it's difficult to reimplement the contact estimation experiment, then directly providing the trained model metrics (that were used to generate these figures) and the number of parameters in the ECNN model would be enough for our purposes.

The repository seems to provide the trained model metrics for the COM experiment in "paper/experiments/com_sample_eff_Solo-K4-C2", so I figured that you might have the trained model metrics for the contact experiment saved somewhere else.

DanielChaseButterfield commented 2 months ago

I've stepped backwards through the commit history of this repository, and found out certain import commits that added back features that the contact estimation experiment depended on:

Operating on commit e702fac, and by changing a few deprecated numpy values to their corresponding python versions (np.int to int, for example), I was able to get a new error output by the code:

python train_supervised.py --multirun robot=mini_cheetah-c2 dataset=contact dataset.data_folder=training_splitted dataset.train_ratio=0.85 dataset.augment=False exp_name=contact_sample_eff_splitted model=contact_ecnn model.lr=1e-5 
pybullet build time: Nov 28 2023 23:52:03
/home/dbutterfield3/Research/MorphoSymm/train_supervised.py:205: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path='cfg/supervised', config_name='config')
[2024-07-30 12:09:34,643][HYDRA] Launching 1 jobs locally
[2024-07-30 12:09:34,643][HYDRA]        #0 : robot=mini_cheetah-c2 dataset=contact dataset.data_folder=training_splitted dataset.train_ratio=0.85 dataset.augment=False exp_name=contact_sample_eff_splitted model=contact_ecnn model.lr=1e-05
/home/dbutterfield3/miniconda3/envs/morph_training/lib/python3.9/site-packages/hydra/_internal/core_plugins/basic_launcher.py:74: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
[INFO][__main__] 

 NEW RUN 

Seed set to 309
Contact Dataset path: 
        - Data: /home/dbutterfield3/Research/MorphoSymm/datasets/contact_dataset/training_splitted/numpy_train_ratio=0.850/train.npy 
        - Labels: /home/dbutterfield3/Research/MorphoSymm/datasets/contact_dataset/training_splitted/numpy_train_ratio=0.850/train_label.npy
Contact Dataset path: 
        - Data: /home/dbutterfield3/Research/MorphoSymm/datasets/contact_dataset/training_splitted/numpy_train_ratio=0.850/val.npy 
        - Labels: /home/dbutterfield3/Research/MorphoSymm/datasets/contact_dataset/training_splitted/numpy_train_ratio=0.850/val_label.npy
Contact Dataset path: 
        - Data: /home/dbutterfield3/Research/MorphoSymm/datasets/contact_dataset/training_splitted/numpy_train_ratio=0.850/test.npy 
        - Labels: /home/dbutterfield3/Research/MorphoSymm/datasets/contact_dataset/training_splitted/numpy_train_ratio=0.850/test_label.npy
[WARNING][nn.EquivariantModules] No cache directory provided. Nothing will be saved
[INFO][nn.EquivariantModules] Cache Loading Failed: No cache directory provided
[INFO][root] ρ(C2[d:54|inv:3] ⋊ C2[d:64|inv:6]) cache miss
[INFO][root] Solving basis for ρ(C2[d:54|inv:3] ⋊ C2[d:64|inv:6]), for G=C2[d:54|inv:3] ⋊ C2[d:64|inv:6]
[INFO][groups.SparseRepresentation] Solving equivariant basis using single generalized permutation matrix (3456, 3456)
3456 eigenvectors found: 100%|█████████████| 3456/3456 [00:00<00:00, 208313.78it/s]
[INFO][root] ρ(C2[d:64|inv:6]) cache miss
[INFO][root] Solving basis for ρ(C2[d:64|inv:6]), for G=C2[d:64|inv:6]
[INFO][groups.SparseRepresentation] Solving equivariant basis using single generalized permutation matrix (64, 64)
64 eigenvectors found: 100%|███████████████████| 64/64 [00:00<00:00, 160932.53it/s]
Error executing job with overrides: ['robot=mini_cheetah-c2', 'dataset=contact', 'dataset.data_folder=training_splitted', 'dataset.train_ratio=0.85', 'dataset.augment=False', 'exp_name=contact_sample_eff_splitted', 'model=contact_ecnn', 'model.lr=1e-05']
Traceback (most recent call last):
  File "/home/dbutterfield3/Research/MorphoSymm/train_supervised.py", line 248, in main
    model = get_model(cfg.model, rep_in=train_dataset.rep_in, rep_out=train_dataset.rep_out, cache_dir=cache_dir)
  File "/home/dbutterfield3/Research/MorphoSymm/train_supervised.py", line 43, in get_model
    model = ContactECNN(rep_in, rep_out, cache_dir=cache_dir, dropout=cfg.dropout,
  File "/home/dbutterfield3/Research/MorphoSymm/nn/ContactECNN.py", line 60, in __init__
    BasisConv1d(rep_in=self.rep_in, rep_out=rep_ch_64_1, kernel_size=3, stride=1, padding=1, bias=bias),
  File "/home/dbutterfield3/Research/MorphoSymm/nn/EquivariantModules.py", line 208, in __init__
    EquivariantModel.test_module_equivariance(module=self, rep_in=self.rep_in, rep_out=self.rep_out,
  File "/home/dbutterfield3/Research/MorphoSymm/nn/EquivariantModules.py", line 387, in test_module_equivariance
    raise RuntimeError(f"{module}\nis not equivariant to in/out group generators\n"
RuntimeError: E-Conv1D G[C2[d:54|inv:3] ⋊ C2[d:64|inv:6]]-W3456-Wtrain:3456=100.0%-init_std:0.239
is not equivariant to in/out group generators
max(f(g·x) - g·y) = 9.143681526184082

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Currently working on figuring out if I should step further back in the commit history, or if I should try and debug this on this commit.

@Danfoa Do you know what could be cause the RuntimeError: E-Conv1D G[C2[d:54|inv:3] ⋊ C2[d:64|inv:6]]-W3456-Wtrain:3456=100.0%-init_std:0.239 is not equivariant to in/out group generators error?

DanielChaseButterfield commented 2 months ago

Looks like, as could probably be expected, the RuntimeError: E-Conv1D G[C2[d:54|inv:3] ⋊ C2[d:64|inv:6]]-W3456-Wtrain:3456=100.0%-init_std:0.239 is not equivariant to in/out group generators error does indeed mean that the convolutional layer is not equivariant with respect to the input and output representations.

I'm wondering if this is because I have different versions of dependencies, which is silently causing my mathmatical calculations to be off. Or, I wonder if I'm simply at a commit halfway between the time of the experimental evaluation and the updated code, where some of the internal test cases don't pass. I'm planning on stepping back further to see if that might be the case.

DanielChaseButterfield commented 2 months ago

Okay, I've stepped back further, and it seems that the above error carries all the way to commit d2c6505, if not further. So it seems like the issue must be some sort of silent dependency error, like perhaps an updated version of some library that I installed behaves differently than a couple of years ago.

I know my numpy version isn't the one that was used at the time of development; as I've needed to replace references of np.int with int and so on (which was removed in numpy 1.26). However, I was unable to get the python library to build with an older version of numpy, so I manually changed those references. I wonder if something similar is silently failing, like maybe emlp or escnn have changed.

Danfoa commented 2 months ago

Hi @DanielChaseButterfield,

So I think I found a solution. Please checkout to the new branch rss2023, in which I setup the old version of the code to work by setting up the appropriate conda env dependencies. Specifically by rolling back Scipy's version.

I have:

PS: There is a computer in which I might have the folder with the results, I will get back to you with this info this week. Again sorry for all the mess. In case I cannot find the files, the only option is to rerun the experiments, which will generate the output .csv used on the scripts to generate the plots of the paper.

Let me know if this helps.

DanielChaseButterfield commented 2 months ago

Sorry for the delay, I wanted to make sure that I could run your changes on my computer. I took your new rss2023 branch and made a few changes to resolve pip dependency conflicts, add a couple missing libraries, and fix import errors. We now have our fork here (https://github.com/lunarlab-gatech/MorphoSymm).

@Danfoa The Scipy rollback was a lifesaver; I am no longer getting the RuntimeError: E-Conv1D G[C2[d:54|inv:3] ⋊ C2[d:64|inv:6]]-W3456-Wtrain:3456=100.0%-init_std:0.239 is not equivariant to in/out group generators error! Additionally, the conda_env.yml file was quite useful for installing everything quickly. Thanks so much for your update; I'm now able to train your models for our paper! Additionally, I can generate Figure 4-Right from On discrete symmetries of robotics systems: A group-theoretic and data-driven analysis.

DanielChaseButterfield commented 2 months ago

However, I do have a couple more issues. There seem to be multiple files for generating Figure 4-Left & Center:

@Danfoa Do you know which one of these I should use?

Additionally, when I run sample_efficiency_figures_contact_CNN-ECNN.py, I get an empty graph with no data. Note that when I ran it, I had six models trained using the debug tools (two of each type), so I have some .csv files from which I can plot results.

I looked into the train_supervised.py file, and although the COM_Momentum dataset appears to have a way of specifying different sample numbers, the Contact Estimation experiment doesn't seem to have this capability currently.

Screenshot from 2024-08-11 00-41-18

Screenshot from 2024-08-11 00-41-32

@Danfoa Did you simply manually edit the dataset partition files based on how many samples you needed, and if so, how does the plotting file know how many samples each run had?

Danfoa commented 2 months ago

Hi @DanielChaseButterfield,

First of all, huge thanks on the PR :). I am very happy to welcome you as a collaborator of the repo.

Answering some of your questions:

I looked into the train_supervised.py file, and although the COM_Momentum dataset appears to have a way of specifying different sample numbers, the Contact Estimation experiment doesn't seem to have this capability currently.

The CoM dataset has a num_samples attributes because it is a synthetic dataset and we can control the number of data points (num_samples=num_train_samples + num_test_samples + num_val_samples). While the Umich contact dataset is a real-world dataset, for which the number of data points is fixed.

@Danfoa Did you simply manually edit the dataset partition files based on how many samples you needed, and if so, how does the plotting file know how many samples each run had?

No these are all "automatically" generated. So the total number of samples in the dataset is partitioned in train_ratio (%), test_ratio(%), val_ratio(%). As far as I recall and can see from the code, the val ratio and test ratio are always set to 15% of the dataset samples. While the train_ratio is controlled as a parameter of the training script in order to test model performance under different number of training samples. I.e. in order to test the model using TP=85,70,50,30,10 [%] of the train+val samples for training you would do smth like:

python train_supervised.py --multirun dataset=contact dataset.train_ratio=0.85,0.7,0.5,0.3,0.1 model.lr=1e-5 dataset.augment=False [... other params]

We keep the same samples for validation and testing for all models, since we want to compare to the same test/val "data".

The logic for the partitions is given here:

https://github.com/Danfoa/MorphoSymm/blob/09cd187201e28fb3eb2c68512d3c1322db1a6858/datasets/umich_contact_dataset.py#L409-L460

This function is used here:

https://github.com/Danfoa/MorphoSymm/blob/09cd187201e28fb3eb2c68512d3c1322db1a6858/datasets/umich_contact_dataset.py#L389-L398

To generate the training and validations splits from the same "trajectory data", while the testing set is generated from "different trajectory data". As explained in appendix D2: image

You can check the partitioning here MorphoSymm/datasets/umich_contact /training_splitted/

the mat folder contains the original Umich dataset recordings used for training/val in my paper. Meanwhile, the mat_test is the recording used during testing.

However, I do have a couple more issues. There seem to be multiple files for generating Figure 4-Left & Center:

sample_efficiency_figures_contact_CNN-ECNN.py sample_efficiency_figures_contact_ECNN.py sample_efficiency_figures.py

I believe the script is sample_efficiency_figures_contact_CNN-ECNN.py. If you get an empty plot is because the filtering of metrics is reducing some of the results. Beware this code is quite hacky; for each plot, it requires you to change the filtered metrics. However since its simply a matplotlib plot, the only thing you need in practice is the .csv files with the metrics.

Again sorry for the messy/hacky code and thanks for the contributions, much appreciated. I improved a lot since those days.

Danfoa commented 3 weeks ago

Hi @DanielChaseButterfield,

I saw the ICRA submission :). Congratulations!.

After the ICLR deadline, I will update the repository, improve the overall reproducibility of experiments and have a list of works built using MorphoSymm. If it is ok with you I will list your work also.

That being said I proceed to close this issue.