Except multiprocess, all things work well on Windows

JunMa11 commented 5 years ago

Dear DKFZ,

Thanks for the great repo.

I want to use this tool for off-line argumentation on Win10, and I follow the code in examples/brats2017. All things work well except the multiprocessing.

I paste the error information. Would it be possible for you to tell me how to solve the problem? My goal is off-line argumentation. I do not pursue efficience and only want it can work.

def main():
    brats_preprocessed_folder = r"Pathto\BraTS2017_preprocessed"

    num_threads_for_brats_example = 1        
    patients = get_list_of_patients(brats_preprocessed_folder)    
    train, val = get_split_deterministic(patients, fold=0, num_splits=2, random_state=12345)

    patch_size = (128, 128, 128)
    batch_size = 2

    dataloader = BraTS2017DataLoader3D(train, batch_size, patch_size, 1)

    batch = next(dataloader)

    # first let's collect all shapes, you will see why later
    shapes = [BraTS2017DataLoader3D.load_patient(i)[0].shape[1:] for i in patients] 
    max_shape = np.max(shapes, 0) 
    max_shape = np.max((max_shape, patch_size), 0)

    # artifacts
    dataloader_train = BraTS2017DataLoader3D(train, batch_size, max_shape, 1)

    tr_transforms = get_train_transform(patch_size)

    tr_gen = MultiThreadedAugmenter(dataloader_train, tr_transforms, num_processes=num_threads_for_brats_example,
                                    num_cached_per_queue=3,
                                    seeds=None, pin_memory=False)

    tr_gen.restart()

    num_batches_per_epoch = 2
    num_epochs = 1
    # let's run this to get a time on how long it takes
    time_per_epoch = []
    start = time()
    for epoch in range(num_epochs):
        start_epoch = time()
        for b in range(num_batches_per_epoch):
            batch = next(tr_gen)
            print(batch['data'][0].shape)
            # do network training here with this batch

        end_epoch = time()
        time_per_epoch.append(end_epoch - start_epoch)
    end = time()
    total_time = end - start
    print("Running %d epochs took a total of %.2f seconds with time per epoch being %s" %
          (num_epochs, total_time, str(time_per_epoch)))

if __name__ == '__main__':
    from multiprocessing import freeze_support
    freeze_support()
    main()

Following error occurred:

runfile('E:/Data/DataAug/BatchGenerator/brats2017_dataloader_3D.py')
Traceback (most recent call last):

  File "<ipython-input-1-4598568443c2>", line 1, in <module>
    runfile('E:/Data/DataAug/BatchGenerator/brats2017_dataloader_3D.py')

  File "D:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "D:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "E:/Data/DataAug/BatchGenerator/brats2017_dataloader_3D.py", line 229, in <module>
    main()

  File "E:/Data/DataAug/BatchGenerator/brats2017_dataloader_3D.py", line 198, in main
    tr_gen.restart()

  File "E:\Data\DataAug\BatchGenerator\batchgenerators\dataloading\multi_threaded_augmenter.py", line 254, in restart
    self._start()

  File "E:\Data\DataAug\BatchGenerator\batchgenerators\dataloading\multi_threaded_augmenter.py", line 224, in _start
    self._processes[-1].start()

  File "D:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)

  File "D:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)

  File "D:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)

  File "D:\ProgramData\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 33, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)

  File "D:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 172, in get_preparation_data
    main_mod_name = getattr(main_module.__spec__, "name", None)

AttributeError: module '__main__' has no attribute '__spec__'

I am looking forward to your reply.

FabianIsensee commented 5 years ago

Mhm difficult. I don't have any experience with Windows. @justusschock maybe knows what to do?

Irrespective of this particular issue I think data augmentation should not be done offline unless absolutely necessary. You can get a lot more variability if you do it online. BraTS is tough due to the four modalities, but you can train a 3D UNet with maybe ~5 CPU cores no problem when running data augmentation on the fly

justusschock commented 5 years ago

That's strange. I tested this on two separate windows machines and this works on both of them. I sometimes get a broken pipe and/or lock due to race conditions if I try to access the same hdf5 file at the same time. Saying that I only got this issue with hdf5 meaning it should not be a general batchgenerators issue.

Can you maybe check if the same applies to you (and maybe create a gist containing a minimum working example to check on this)?

Are you maybe using additional custom multiprocessing within your loader/dataset?

EDIT : Why do you restart the generator directly after creating it?

FabianIsensee commented 5 years ago

Hi,

Why do you restart the generator directly after creating it?

That is on me. The code is something I wrote. See, if you initialize the MTA it will not start generating batches right away. It will do so only after you requested the first batch OR if you restart it. It is just a habit, but I usually initialize the data augmentation pipelin, then initialize the network and so on. If I restart the MTA then it will already start generating batches while the main process is busy with other things. It is a little more efficient. But at the end of the day it doesn't matter because a training can run for days. So a few seconds at the start wont change much.

@JunMa11 how did you install your python environment? Conda?

justusschock commented 5 years ago

@FabianIsensee maybe you might want to include a restart at the end of the initialization? I think this would match the excepted behavior more than having to restart it manually.

I also think the only thing that matters is how you installed the package itself, as conda simply provides an encapsulated environment as long as you don't install packages via conda (an I can confirm it works well with conda).

FabianIsensee commented 5 years ago

I've had people with random issues caused by conda environments. That's why I was asking =)

JunMa11 commented 5 years ago

@FabianIsensee Yes, I install the python environment by conda.

JunMa11 commented 5 years ago

Hi, @justusschock thanks for your reply very much.

I only got this issue with hdf5 meaning it should not be a general batchgenerators issue. Can you maybe check if the same applies to you (and maybe create a gist containing a minimum working example to check on this)?

Sorry, I do not know what hdf5 means. I provide an ErrorDemo to reproduce my error. The enviroment is win10, python 3.6, Anaconda3-5.1.0-Windows-x86_64.

Are you maybe using additional custom multiprocessing within your loader/dataset?

I do not have experience on multiprocessing. Would it be possible for you to give me more insights?

JunMa11 commented 5 years ago

Hi, @FabianIsensee thanks for your comment on offline data augmentation.

Irrespective of this particular issue I think data augmentation should not be done offline unless absolutely necessary. You can get a lot more variability if you do it online.

I agree with you that online augmentation can obtain more variability. My motivation of offline data augmentation is following.

I look at my whole tumor segmentation on brats 2018, most of the cases can get good results (Dice>0.88), but few "hard" cases get very low Dice (0.6-0.7). I want to do some offline data argumentation for the case with low Dice score and do online data augmentation during training, too. In this way, I hope the network can learn these "hard" cases better. Could you share your comment on this idea?

justusschock commented 5 years ago

I'll test this on Monday. Unfortunately my local machine is running linux.

If you are not familiar with multiprocessing, you most likely don't have a custom one. I thought you might have additional multiprocessing inside your BraTS2017DataLoader3D which may have caused the problem, but this does not seem to be the case, so nevermind :)

what else do you have installed inside your environment?

Regarding the augmentation. Maybe it would be worth considering a weighted sampling together with online augmentation to present hard cases more frequently?

JunMa11 commented 5 years ago

Hi @justusschock Thanks for your quick reply. These screenshots show the python packages in my environment.

Weighted sampling is a good idea that I missed. Thank you very much.

FabianIsensee commented 5 years ago

I agree with @justusschock . You should probably sample difficult cases more often rather than augmenting them offline.

justusschock commented 5 years ago

So I just got the time to test this and I absolutely can't reproduce the error.

I tried the script you provided (which should be similar to the one by @FabianIsensee ). The only thing I noticed: I had to clone the repo again manually, since you mixed the setup code with the actual implementation (probably you just copied it there to get the imports working without an install). After I did a clean clone and a clean install everything worked like a charm (even with multiple epochs).

The steps I did are:

Create a new conda environment: conda create -n batchgen_test python=3.6
Activate the environment: conda activate batchgen_test
clone the repo: git clone https://github.com/MIC-DKFZ/batchgenerators (maybe this has to be executed in a git bash)
Cd into the repo: cd batchgenerators
Install the repo locally: pip install -e .
Cd to thescript to execute: cd YOUR/PATH/HERE
Execute the script: python brats2017_dataloader_3D.py

and the output was like

python brats2017_dataloader_3D.py
(4, 128, 128, 128)
(4, 128, 128, 128)
(4, 128, 128, 128)
(4, 128, 128, 128)
(4, 128, 128, 128)
(4, 128, 128, 128)
Running 3 epochs took a total of 38.92 seconds with time per epoch being [25.76951551437378, 3.648458957672119, 9.5059654712677]

The time is not representative since I'm running some heavily CPU-consuming tasks in parallel.

Can you maybe try this and confirm if this works?

JunMa11 commented 5 years ago

Hi, @justusschock . I very appreciate your time and valuable help. Following your guidance, batchgeneraters works well now. Thank you very much.

MIC-DKFZ / batchgenerators

Except multiprocess, all things work well on Windows #23