elephant-track / elephant-server

A server implementation of ELEPHANT
BSD 2-Clause "Simplified" License
7 stars 5 forks source link

Server crashes when training model before resetting it #4

Open tischi opened 3 years ago

tischi commented 3 years ago

(see title)

Here is the error:

GPU is available
auto_bg_thresh: 0
c_ratio: 0.3
class_weights: [1, 10, 10]
crop_box: None
crop_size: [16, 128, 128]
dataset_name: elephant-demo
debug: False
device: cuda
false_weight: 3
is_livemode: False
is_pad: False
keep_axials: (True, True, True, False)
log_dir: /workspace/logs/seg_log
lr: 5e-05
model_path: /workspace/models/seg.pth
n_crops: 3
n_epochs: 3
output_prediction: False
p_thresh: None
patch_size: None
r_max: None
r_min: None
rotation_angle: 0
scale_factor_base: 0
scales: [2.48, 0.3119629, 0.3119629]
timepoint: 0
use_median: None
zpath_input: /workspace/datasets/elephant-demo/imgs.zarr
zpath_seg_label: /workspace/datasets/elephant-demo/seg_labels.zarr
zpath_seg_label_vis: /workspace/datasets/elephant-demo/seg_labels_vis.zarr
zpath_seg_output: /workspace/datasets/elephant-demo/seg_outputs.zarr
[2021-05-11 18:00:15,663] ERROR in app: Exception on /train/seg [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "./main.py", line 544, in train_seg
    config.device)
  File "/usr/local/lib/python3.7/site-packages/elephant/models.py", line 313, in load_seg_models
    checkpoint = torch.load(model_path)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 525, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 212, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.7/site-packages/torch/serialization.py", line 193, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/models/seg.pth'
[pid: 10865|app: 0|req: 1/1] 127.0.0.1 () {40 vars in 519 bytes} [Tue May 11 18:00:15 2021] POST /train/seg => generated 290 bytes in 46 msecs (HTTP/1.1 500) 2 headers in 99 bytes (1 switches on core 0)
127.0.0.1 - - [11/May/2021:18:00:15 +0000] "POST /train/seg HTTP/1.1" 500 290 "-" "unirest-java/3.1.00" "-"
[pid: 10864|app: 0|req: 1/2] 127.0.0.1 () {40 vars in 517 bytes} [Tue May 11 18:00:38 2021] POST /reset/seg => generated 40 bytes in 4 msecs (HTTP/1.1 500) 2 headers in 90 bytes (1 switches on core 0)
127.0.0.1 - - [11/May/2021:18:00:38 +0000] "POST /reset/seg HTTP/1.1" 500 40 "-" "unirest-java/3.1.00" "-"
[pid: 10864|app: 0|req: 2/3] 127.0.0.1 () {40 vars in 517 bytes} [Tue May 11 18:00:47 2021] POST /reset/seg => generated 40 bytes in 0 msecs (HTTP/1.1 500) 2 headers in 90 bytes (1 switches on core 0)
127.0.0.1 - - [11/May/2021:18:00:47 +0000] "POST /reset/seg HTTP/1.1" 500 40 "-" "unirest-java/3.1.00" "-"

It would be nice if this was handled in a way that does not crash the server (within Elephant-client it says now that training is in progress and one cannot do anything).

tischi commented 3 years ago

When I tried restarting the server I got this error message:

zpath_input: /workspace/datasets/elephant-demo/imgs.zarr
Traceback (most recent call last):
  File "./main.py", line 662, in reset_seg_models
    init_seg_models(config)
  File "/usr/local/lib/python3.7/site-packages/elephant/common.py", line 670, in init_seg_models
    input_shape = zarr.open(config.zpath_input, mode='r').shape[-3:]
  File "/usr/local/lib/python3.7/site-packages/zarr/convenience.py", line 102, in open
    err_path_not_found(path)
  File "/usr/local/lib/python3.7/site-packages/zarr/errors.py", line 29, in err_path_not_found
    raise ValueError('nothing found at path %r' % path)
ValueError: nothing found at path ''

[pid: 12268|app: 0|req: 2/2] 127.0.0.1 () {40 vars in 517 bytes} [Tue May 11 18:07:30 2021] POST /reset/seg => generated 48 bytes in 120 msecs (HTTP/1.1 500) 2 headers in 90 bytes (1 switches on core 0)
127.0.0.1 - - [11/May/2021:18:07:30 +0000] "POST /reset/seg HTTP/1.1" 500 48 "-" "unirest-java/3.1.00" "-"
ksugar commented 3 years ago

It would be nice if this was handled in a way that does not crash the server (within Elephant-client it says now that training is in progress and one cannot do anything).

I will work on it, thanks! If the model file is not found, the ELEPHANT client will show a dialog something like "Model file xxx.pth not found. Would you like to create it?".

ksugar commented 3 years ago

When I tried restarting the server I got this error message:

zpath_input: /workspace/datasets/elephant-demo/imgs.zarr
...

It seems that the dataset is missing after restarting. I have not prepared the docs for it but you can make the data persistent by mounting Google Drive.

tischi commented 3 years ago

If the model file is not found, the ELEPHANT client will show a dialog something like "Model file xxx.pth not found. Would you like to create it?".

In which case would I answer this question with "No"? If the answer should always be "Yes, why not just do it?

It seems that the dataset is missing after restarting. I have not prepared the docs for it but you can make the data persistent by mounting Google Drive.

It is already quite tedious to start the server. I think we should avoid any extra step. Could the code be changed such that the data set is not missing?

ksugar commented 3 years ago

In which case would I answer this question with "No"? If the answer should always be "Yes, why not just do it?

Although it does not happen very often, there is a possibility that the user may enter an incorrect file name. However, as you mentioned, a workflow that automatically initializes the file if it does not exist would be sufficient.

It is already quite tedious to start the server. I think we should avoid any extra step. Could the code be changed such that the data set is not missing?

I think that this is a problem on Google Colab, not a problem that can be controlled by ELEPHANT's code. If you interrupt the last cell and run it again, the data set should be retained. However, once the connection has been lost, you might need to explicitly run the following cell again. (As far as I can confirm, this is not necessary unless you explicitly run the [Runtime > Factory reset runtime]).

# Set up dirs
!mkdir -p /workspace/models
!mkdir -p /workspace/datasets/elephant-demo
# Download files
!curl -L  https://github.com/elephant-track/elephant-server/releases/download/v0.1.0/elephant-demo_seg.pth \
  -o /workspace/models/elephant-demo_seg.pth
!curl -L https://zenodo.org/record/4549193/files/elephant-demo.h5?download=1 \
  -o /workspace/datasets/elephant-demo/elephant-demo.h5
!curl -L https://zenodo.org/record/4549193/files/elephant-demo.xml?download=1 \
  -o /workspace/datasets/elephant-demo/elephant-demo.xml
# Run script
!python /opt/elephant/script/dataset_generator.py \
  --uint16 /workspace/datasets/elephant-demo/elephant-demo.h5 /workspace/datasets/elephant-demo
ksugar commented 3 years ago

I'm checking the error again. The following error looks a little weird.

ValueError: nothing found at path ''

It implicates that config.zpath_input is empty. Could you reproduce the error? It is helpful if you can paste the whole error text.

tischi commented 3 years ago

OK! If the error occurs again I will post the whole message!

However, as you mentioned, a workflow that automatically initializes the file if it does not exist would be sufficient.

I cannot judge all the details here! Whatever you think makes more sense! The most important thing is that it does not crash :)

I think that this is a problem on Google Colab, not a problem that can be controlled by ELEPHANT's code.

That's really funky that Google looses the data in that way, because I would imagine that it's simply just stored on that machine where the code is running. But maybe they have some more fancy setup...

In fact, if that's the case I'd be curious to learn how to host the data on Google Drive. Could you let me know (or add it to the documentation)?