Introduces the Model Parallel option: model is split into 4 main components/stages, that can be distributed over multiple GPUs. Worker takes care of moving tensors from one device to the other. Enables to run model on multiple "small" GPUs
improvements to nsfw filter removal (saving on memory and inference speed)
fixes to seed fixing for data parallel and env variables reading