Difficulties recreating the dataset

akselaase commented 1 year ago

Hi, first of all thank you for all the great work on this project.

I'm currently trying to recreate your dataset locally using the instructions in the README and dataset/README, but I have some questions that I'm hoping you can answer. I'm basing all this on the assumption that the dataset is recreatable (not pixel-perfect, but same frame count and folder structure) using only the scenarios and routes in leaderboard/data/training and with the scripts in tools/dataset/, so please let me know if I'm missing something there.

In the 2022 TransFuser dataset the folders have names such as "dirt", "coke", "int_l", and "cycl". What is the significance of these names, and do they somehow map 1:1 to the scenarios in leaderboard/data/training/scenarios?
The subdirectories have names such as "clipped", "10mshortroutes" and "non-straight-junction", does these names map to specific scenarios?
How did the names emerge anyway? I see no mention to "clipped" or "10mshortroutes" anywhere else in the repo.
When using leaderboard/scripts/datagen.sh, it is required to set both ROUTES and SCENARIOS. However, the ll, lr, rl, rr routes in leaderboard/data/training have no corresponding scenario, how do you run the script to generate data using those routes?

Thanks again!

Kait0 commented 1 year ago

The names refer to the scenarios: e.g. dirt is scenario 1 where a dirt patch spawns on the road. If you look at the name of the folders in these folders you will see the corresponding scenario file.
Don't think they mean anything.
Hardcoded by some assistant in the dataset generation script. We didn't release it as it is specific for our compute cluster.
These are lane change routes, we don't use scenarios for them. We use an no_scenario.json file which looks like this:
```
{
"available_scenarios": [
    {

    }
]
}
```

I'm currently trying to recreate your dataset locally

Trying to recreate the dataset with 1 computer might be too slow. When we run these 3000 routes we parallelize them across ~100 GPUs in our compute cluster. This gives almost linear speedup in dataset generation as the routes are all independent.

I'm basing all this on the assumption that the dataset is recreatable

It is, up to the specific routes that crashed during parallelization. The CARLA traffic manager is also nondeterministic (and can't be seeded in this version of the simulator), so you will get a similar dataset but not the same one.

akselaase commented 1 year ago

That clears the confusion, thank you very much! And yes, my poor laptop wouldn't be too happy about creating this dataset itself, this all happens on our cluster.

Really appreciate the quick response!

autonomousvision / transfuser

Difficulties recreating the dataset #145