Error saving model at the end of training epoch

SydneyBioX / BIDCell

Biologically-informed deep learning for cell segmentation of subcelluar spatial transcriptomics data

Other

33 stars 4 forks source link

Error saving model at the end of training epoch #4

Closed pakiessling closed 8 months ago

pakiessling commented 9 months ago

Hi,

I am testing BIDCell on a small subset of my MERFISH data using the provided merfish.yaml.

When I set total_steps for training to 60 everything works as expected, but when I set it to 2000 or 4000 I receive the error:

line 127, in predict checkpoint = torch.load(load_path) No such file or directory: /model_outputs/2023_09_30_14_37_00/models/epoch_1_step_4000.pth' and the output directory only containsepoch_1_step_0.pth

I also tried training for 2 epochs of 2000 steps but then again only epoch_1_step_0.pth and epoch_2_step_0.pth are saved

Any idea?

xhelenfu commented 8 months ago

Hi, the number of training patches in your dataset may be fewer than 2000. You can try to set the number of steps to be no larger than the number of training patches available.

pakiessling commented 8 months ago

Hi @xhelenfu,

my image has size 12,000 x 12,000 and my patch size is 64,

does that mean I should set my training parameters to 1 epoch (12,000 / 64 ) steps? Same for the testing?

Thank you!

xhelenfu commented 8 months ago

For the datasets we have used, we found that 4000 total steps gave good results. It may be worth starting from a similar subset of available patches. Up to 80% of available patches can be used for training by default currently. test_step can be set to a value that corresponds to a model saved during training (under model_outputs/{timestamp}/models).

pakiessling commented 8 months ago

Thanks that was indeed the issue. Runs fine now :)

Last question, the criticial output file is /results/cell_gene_matrices/2023_10_31_09_32_29/expr_mat.csv correct ?

Ah it is in the README <.<