AllenCellModeling / pytorch_fnet

Three dimensional cross-modal image inference
Other
151 stars 61 forks source link

Problem training fnet in google colab #145

Closed lucpaul closed 4 years ago

lucpaul commented 4 years ago

Hi, as requested I am posting this as an issue although I am aware that this problem is already being dealt with. I have been stuck in the training step of fnet in google colab. As a test, I downloaded only the tubulin dataset by changing the download_all.sh with no issues. The problem arises when I run the command to train a model (train_model.sh), see below. I don't see the training initiating and after the cell has run I cannot locate the saved model anywhere, even when I change the directory it is saved to in the train_model.sh. I also noticed that when I run the cell it finishes in seconds, although I would expect it to take at least a few minutes/hours, yet, I don't get an error message. Also, the split_dataset.py and train_model.py are not in the directories that are called in the original train_model.sh, so I changed it to the folders I found them in which were fnet/utils/ for split_dataset.py and fnet/cli/ for train_model.py. Could that cause issues? When trying to use prediction I could not locate the output directory either, but again I did not get an error message. Below is a screenshot of the cell I tried to run, and the output from colab. Thanks for your support in this issue.

colab_screenshot_fnet_training

linhuawang commented 4 years ago

Same issue here... I cannot find the corresponding scripts in the train_model.sh folder. Scripts end in seconds, without any error message and obviously no training. I figured out that there is no main function in the train_model.py in master/release_1 branch. I don't know if I used the scripts from the wrong branch. I also followed the instructions in the README to run "fnet train " command and try to use the examples provided, but both didn't work for me.

gregjohnso commented 4 years ago

Getting a functioning demo is #1 priority. Expect something before the end of the week.

gregjohnso commented 4 years ago

I have a pull request for a functioning demo, if you'd like to look at it before it's merged into master, please look at examples/download_and_train.py in this branch: https://github.com/AllenCellModeling/pytorch_fnet/tree/BUGFIX/download_and_train

linhuawang commented 4 years ago

Figured out the issue is that I git cloned the wrong branch. release_1 works for me.

gregjohnso commented 4 years ago

The download_and_train Bugfix has been pushed into master, @lucpaul, could you give it a look?

lucpaul commented 4 years ago

Hi, I gave it another go, but I seem to have the same problem as before. If I use the unmodified train_model.sh script, it first doesn't find the train.py and split_dataset.py. And after I change the train_model.sh to get the correct directories these files are in, it gives me the same short output, neither really initiating training nor throwing an error. Here, is the link for the notebook I'm trying to run: https://colab.research.google.com/drive/1ZJCI2p66noTaLCnVUQJkTR16ig6GAqAx

If it's in the master branch now, I should be able to just git clone it in the standard way, right? I hope if I made a trivial error it should be easy to spot in the colab. Thanks for the support.

gregjohnso commented 4 years ago

@lucpaul could you try to run the download_and_train.py file in the examples directory?

lucpaul commented 4 years ago

Thank you so much, training seems to work now. I trained it on a fewer iterations to test today (10000), but it certainly looks promising. Just the output of predict.py still doesn't look good, I can't see any images. Attached is a screenshot. Does this have to do with the shorter training or perhaps with the way I am trying to display the images?

image

gregjohnso commented 4 years ago

I'd recommend inspecting the values of the pixels in the image to confirm that they are within the display-range of your viewer.

gregjohnso commented 4 years ago

@lucpaul were you able to solve your display issue?

lucpaul commented 4 years ago

Yes. I attempted it with release_1 though. I have yet to go back to the current master branch. But it worked fine with io.imread instead of plt.imread, so just a minor thing. Fnet release_1 works fine now on my own data, too.

gregjohnso commented 4 years ago

Ok cool. I'm going to close the issue. Please reopen it if you need to.