Closed kirbytran12324 closed 4 months ago
Hey, I had a chance to glance over your code. I did not execute it though. I have some follow-up questions:
regression_lightning.py
-script?regression_lightning.py
: loss = self.loss_func(y_pred.squeeze(), y)
No problem using Copilot, but also make sure that you also understand it and verify that what it said was true ;)
unet_precip_regression_lightning.py
since this is my first project doing machine learning so i would rather have a bit of a spine as to which parts are required and also this is what i presumed to be what SmaAt-UNet uses. I do touch upon regression_lightning.py
but more like to test on my own data.at
loss = loss_func(model(xb.to(dev)), yb.to(dev))and the value set to it is
loss_func = nn.CrossEntropyLoss().to(dev). I see that in
regression_lightning.py`, the loss function is shared between all UNet-based models, is that right? My initial impression of them is that they were only for the training of other U-Net models that isn't SmaAt so I paid less attention to it than otherstiff_to_h5.py
and then used create_dataset.py
to apply necessary modification and then i just used the site https://myhdf5.hdfgroup.org/ to view it, mostly to check if there's anything going wrong, when I just composed the original hdf file demo.h5
i see that everything is fine, but when I made the child hdfs, i kept seeing memory access out of bounds or read permission problem https://drive.google.com/file/d/1VhEnEbh1TvxUVeHxfORJ-BVjPCQ4uZEX/view?usp=sharing (i linked my whole dataset here since it's not that big but still too large to put in a zip file for github). However, the main question i have is whether the create_dataset.py
or tiff_to_h5.py
is responsible or if it's just a limitation in the viewing programme.Thanks for the answers.
unet_precip_regression_lightning.py
. There SmaAt-UNet is UNetDS_Attention
(this was the name of the model describing what it is composed of before I came up with SmaAt-Unet ;) ).to(dev)
as possible. This is old-school and not necessary anymore. Lightning should take care of that when doing that once in the beginning of the training. Yes, the loss is for all the same. Note that the test_step
-function has the hard-coded denormalizing value 47.83 in there. This is something that you would need to adapt when using a different dataset.README.md
. Why do you think it is inefficient? This could be done either in the pre-processing step or on-the-fly in the dataloader..h5
file and have a look there with the python package https://docs.h5py.org/en/stable/index.html. Just note that I used this data format since the images already were in hdf-format. Before this, I was used to normal images such as jpg, png or similar. You may be better off with using your tiff-files instead of converting them to a single .h5-file. Here is a good start to writing a data loader: https://lightning.ai/docs/pytorch/stable/data/datamodule.htmlHope this helps!
So, a bit of an update for what I've done so far. I cropped the dataset to 240x80 which should be adequate for the whole downscaling process. and testing the forward function seems to work well so i think i'm good on this front. However, i still find myself not being able to train the model with train_loss_step, val_loss, train_loss_epoch all shown as nan.0 and everything stops at Epoch 0. Either I'm still trying to load the dataset incorrectly or there's still something wrong with the dataset. I am still thinking about using .h5 as i figured this is the better way to store all data in one place. As for the denormalizing value, I know that I should use Min-Max Normalization to find it but honestly, with the min being 0 and max being 260, i don't know how to apply this in the formula. Also a small note, since my dataset is measured in mm/h and not mm/5m, i assume i can just remove * 12 from the return process, I couldn't find if there's any other sections related to this yet though so there's might be some spots that i missed. SmaAt-UNet.zip https://drive.google.com/file/d/1VHpOYi7kU3AJa1Zu4s6TGHJtc5juccPb/view?usp=sharing (Public Dataset)
I cropped the dataset to 240x80 which should be adequate for the whole downscaling process. and testing the forward function seems to work well so i think i'm good on this front.
Great!
However, i still find myself not being able to train the model with train_loss_step, val_loss, train_loss_epoch all shown as nan.0 and everything stops at Epoch 0
Hmm, I can't open the zip file. Wouldn't it be easier to just create a git repository and send me a link to it? You can even fork from this repository.
As for the denormalizing value, I know that I should use Min-Max Normalization to find it but honestly, with the min being 0 and max being 260, i don't know how to apply this in the formula
The formula from wikipedia is quite easy to read in my opinion. With your min being 0 and max 260 you just need to divide all your values by 260. Then all values are between 0 and 1.
Also a small note, since my dataset is measured in mm/h and not mm/5m, i assume i can just remove * 12 from the return process
Sure. Just make sure that you are comparing the same values later on.
I'm pretty sure I made a fork and even made it public but I guess it doesn't appear on the fork list it seems Anyways, here's the repo https://github.com/kirbytran12324/SmaAt-UNet The only things missing are the h5 files in the dataset folder since they are over the 100mb limit of Github but I linked the dataset zip file in gg drive above As for the denormalizing value, since I saw 47.83 being hard-coded in your example, I thought I would need to find a general value for all in the dataset, otherwise, i understand the formula just fine (actually learned about this on a video about fall-off damage in video games ;) )
Oh, you're right. Thanks for the link.
That is indeed a bit strange. I have never seen nan.0
before. Also googling is not giving any results. Have you tried setting a break point in the loss function calculation and debugging? I assume there is an issue there.
Btw, I think the loss = loss.squeeze()
is not needed.
actually learned about this on a video about fall-off damage in video games ;)
Interesting, do you have a link? :D
I'm not super familiar with debugging, unfortunately. I'm the type that usually go for the trials and errors approach, keep running, occasionally change things that i figure makes sense and hopefully it'd work. I tried setting breakpoints and use debug mode in Pycharm, but when I tried to run, nothing came up, the only options available were stop and rerun. Just a question to make sure I wasn't just dumb this whole time, I tried to train using train_precip_lightning.py
and was going to test using test_precip_lightning.py
, I'll look at the metrics part later.
This is probably unrelated but how long does the train process take for you? I know your dataset is way larger than mine but mine finished way too quickly imo (about 2min max). Maybe some steps was skipped due to the nan.0
but I believe the training process would have still went for a few more loops or something.
Btw, I think the loss = loss.squeeze() is not needed.
Yea that was back when I was trial and erroring for the tensor issue since i thought it would remove the excess 1 in the target. That part is probably good now
Interesting, do you have a link? :D
I don't play games as much anymore but sometimes, it's easier to learn when presentation is in an interesting format, don't you think? https://youtu.be/VL2VnkNJPpE?si=lSIIRESALBVfleK6
I'm the type that usually go for the trials and errors approach, keep running, occasionally change things that i figure makes sense and hopefully it'd work.
That works to a certain extend, but when the problem has too many knobs and dials it becomes increasingly harder to find the root cause for the bug. I very much recommend debugging the problem using a debugger. It also forces you to understand and also to think about what you expect the value to be and compare it with the actual value you received. It boosted my understanding of the underlying algorithms and code by a lot!
I tried setting breakpoints and use debug mode in Pycharm, but when I tried to run, nothing came up, the only options available were stop and rerun.
Did you start the program in debug mode? It could be that when you use cuda breakpoints do not work. For that purpose you should use the cpu.
This is probably unrelated but how long does the train process take for you? I know your dataset is way larger than mine but mine finished way too quickly imo (about 2min max)
One epoch takes me roughly 1.5-2min, but it then runs for around 60 epochs. It definitely should not stop after just 1. There is definitely something going wrong.
it's easier to learn when presentation is in an interesting format
100%!
I think I found what went wrong, not 100% certain though
I noticed in your train_precip_lightning.py
, in the __main__
, you have different dataset folder for parser
and args
, what exactly is the purpose for those two?
I set those 2 to be the same since I thought it wouldn't affect anything but now I'm reconsidering my decision there.
I know this is probably a lot to ask but can you try compiling the dataset from the tiff files? I still have some doubts about it and I'm hoping to verify its validity, especially since i see quite a few -inf values in the dataset and i'm not quite sure how to deal with them. The files in my dataset are still in 250x90px but i cropped it to 240x80 in my demo.h5
and then modify it further using your create_dataset.py
and if you're wondering if i used the correct dataset with the settings in the regression_lightning.py
, yes, I checked it multiple times every time i changed the dataset
Update: So I tried to remove the -inf like this
def replace_inf_with_finite_image(image): """Replace -inf values with the minimum finite value in the image.""" min_finite = np.min(image[np.isfinite(image)]) image[np.isneginf(image)] = min_finite return image
and it kinda worked, the nan.0 doesn't appear anymore, it runs multiple rounds now but the values shown are way too large
Progress! Great. Did you normalize the data after replacing the infinite values? That could be the reason for those high values.
I won't have time to run it myself as I'm going on holiday tomorrow.
Oh yeah, I totally forgot about it for a second there So what's the optimal approach here? Do I create a separate group in the .h5 that contains the normalizing value for each of the images in order? I mean, isn't that the way to denormalize values in each of the maps separately? Coming back to your value of 47.83, I'm still not 100% certain about its purposes right now ;-; I'm making a wild guess and say this is the highest rain amount in your dataset? And I think it's related to the bins as well since my dataset is already in mm/h so I'm not sure how to apply that.
I'm going on holiday tomorrow.
Well, I wish you have a wonderful vacation :3
I'm making a wild guess and say this is the highest rain amount in your dataset?
Correct. You said above that the highest value you have in your dataset is 260, so you divide all values by that.
Do I create a separate group in the .h5 that contains the normalizing value for each of the images in order?
No, not for each image. By the highest value in your training set. For me the highest value was 47.83.
Do you know the reason why you normalize your data in general? Why the values are supposed to be between 0 and 1? If not, then I recommend having a look for why ;)
so.... I went ahead and did it anyways, i normalized by 260 but even then, the loss values are still up in e+4 which sucks since I don't know if it's normal or not at this point
train_loss_step
seems to be very random, going around e+3 and e+5, val_loss
and train_loss_epoch
starts at 1.79e+4 and 4.9e+4 and slowly going down
the whole process stops at epoch 9
Update: So, I was an idiot, forgot to change the name of the input file in create_dataset.py
so the results was the same.
Now that it's gone, the scripts seems to work properly. I say "seems" since obviously I'm running under suboptimal condition with the dataset in an 1h interval instead of 5m interval like normal U-Net-based models suppose to.
The results were less than exciting, unfortunately, SmaAt-UNet seems to be inheriting the worse quality of the 2 parents. The only positive aspect to it is the model size being smaller. Though, didn't expect MSE to be so high, probably gonna figure out how to help the model manage long-range prediction better.
Thanks for the update! Good to hear (that you got it to work, not the bad performance ;) )!
I can't say what the reason for the performance is though. Could be multiple factors, the dataset most of the time being one of the main ones.
I think it's time I close this issue, I got a few other projects that need attention right now and I'm gonna need a more suitable dataset for this anyways. Definitely coming back to this, the premise is too good to ignore. I'm gonna reopen the issue if anything new comes up. Thank you so much for the help :3
Hello! Sorry if this question sounds a bit dumb in advance I was trying to use my custom dataset compiled using a number of Tiff files, I made an utility class to help me convert it to hdf5 and then tried to use create_dataset.py to make it as close as possible to the original dataset as mine was pre-processed. However, I find myself with numerous issues, one being the lost function requiring the target tensor to be 3D while I found mine to be 4D instead. Even after i got the error away, the results after training wasn't much better with most of the stats showing to be NaN or not fluctuating at all. I was wondering if the problem is from my dataset or I'm just not applying it correctly. demo.zip I put in most of the files that i find myself modifying (P.S I used copilot to help me understand the code so it probably contributed to me failing so hard ;-;)