Closed JLrumberger closed 1 year ago
In terms of checking that this didn't inadvertently break anything, here are some options:
Let me know what you think, or if you have other ideas for good sanity checks.
There seems to be a considerable performance gap between baseline models trained with the old and the new codebase and I try to find the root cause. It seems to be a problem with the newly generated tfrecord dataset that includes membrane and nuclei channel. I ran two identical models with the new code-base on the new and old dataset and it produced pretty different results..
Checked the datasets and they are identical except for the newly introduced nuclei and membrane channels.
So could it be a problem with how those channels are constructed? You’re still seeing a difference in performance for the same model on the old and new dataset?
All these models use the same training hyperparameters and these are my observations:
I am a bit suspicious now that it could be due to multi-gpu training. All models with the new codebase are trained on 4 GPUs..
I agree, we should try and change as few things as possible given that this PR is already so big.
Were the models trained on the new dataset also using the nuclei/membrane channel? Or it was generated but not used during training?
I agree, we should try and change as few things as possible given that this PR is already so big.
Were the models trained on the new dataset also using the nuclei/membrane channel? Or it was generated but not used during training?
The models were trained without the nuclei/membrane channel.
Okay, so to summarize:
The code was changed to generate a nuclei and membrane channel, rather than just the target channel. However, the nuclei and membrane channels weren't included, and the rest of the dataset looks identical.
The code was changed to allow tracking of multiple different datasets. Even with only a single dataset, the output now sometimes looks much worse
The models were trained on multiple GPUs, whereas before they were only trained on a single GPU.
What's your hypothesis for why new codebase, new dataset looks different from new codebase, old dataset? Given that they are identical? Could this just be random noise from one bad training run?
Yep, I hope it's because of the initialization. I restarted all setups on single GPU, so tomorrow I'll know more.
Sounds good!
Oh god, it has been the multi GPU training. The loss for the baseline models looks normal when trained on a single GPU. PromixNaive models look as good as before. I just have one small little change to commit and then you can review and we can merge it in.
In addition to the training loss, will you run the test metrics and make sure there aren't any differences in performance?
Which of the above scenarios did you test out? Keeping the TONIC dataset the same, but training with the multi-dataset code?
I tested the following scenarios:
train baseline on tonic only with old dataset, f1=0.694
train baseline on tonic only with new dataset, f1=0.688
train baseline on tonic and decidua, sampling=[0.9999, 0.0001], f1=0.668
train baseline on tonic and tonic, sampling=[0.5, 0.5], f1=0.7214
train promix on tonic only with new dataset, f1=0.656
train promix on tonic and decidua, sampling=[0.9999, 0.0001], f1=0.633
train promix on tonic, decidua, msk_colon, sampling=[0.34, 0.33, 0.33], f1=0.634
so far I only looked at the validation loss and compared it to older models. I can calculate validation metrics and look for differences there, but it should look alright given the validation loss is similar as before.
Okay great! Sounds like you rooted out the problem. I think it'll just be better to be super sure that there aren't any lingering issues before we move forward.
@ngreenwald I am happy with this PR now.
What is the purpose of this PR?
This PR closes #39 by adding multi-dataset training for
PromixNaive
andModelBuilder
. This PR also makes it possible to define the constituents of the input data explicitly (i.e.[marker_channel, binary_mask]
as input or[marker_channel, binary_mask, nuclei_img, membrane_img]
), so that we can do experiments on what works best here.How did you implement your changes
Multi dataset training I changed many small parts in
ModelBuilder
andPromixNaive
regarding dataset preparation. Before, all this code worked on single datasets, whereas now it works on lists of datasets. The training datasets are in the end collapsed viatf.data.Dataset.sample_from_datasets
to a singletrain_dataset
that samples from all its constituent datasets. In addition, I madePromixNaive.class_wise_loss_selection
calculate percentile thresholds for every marker and every dataset individually. Finally, I added multi-dataset capabilities to theevaluation_scrip.py
.Training with additional input channels Static class functions
ModelBuilder.prep_batches
andPromixNaive.prep_batches_promix
are now defined in class functionsModelBuilder.gen_prep_batches_fn
andPromixNaive.gen_prep_batches_promix_fn
. The latter functions take in a list of constituent channel names (i.e. "mplex_img", "binary_mask", "nuclei_img", "membrane_img") and return a function that takes anexample
dict and returns batches where the input data consists of constituent channels.Fixed some bugs in
metrics.py
Functioncalc_metrics
threw errors when some of the cells had label==2 in the ground-truth activity. I added code that excludes these cells from metric calculation.Remaining issues There could remain some issues when using this code in production since I touched many things for this PR..