HansBambel / SmaAt-UNet

PyTorch-Code for the Paper "SmaAt-UNet: Precipitation Nowcasting using a Small, Attentive UNet-Architecture"
230 stars 43 forks source link

The format of the input dataset #30

Closed kirbytran12324 closed 4 months ago

kirbytran12324 commented 5 months ago

Hello! Sorry if this question sounds a bit dumb in advance I was trying to use my custom dataset compiled using a number of Tiff files, I made an utility class to help me convert it to hdf5 and then tried to use create_dataset.py to make it as close as possible to the original dataset as mine was pre-processed. However, I find myself with numerous issues, one being the lost function requiring the target tensor to be 3D while I found mine to be 4D instead. Even after i got the error away, the results after training wasn't much better with most of the stats showing to be NaN or not fluctuating at all. I was wondering if the problem is from my dataset or I'm just not applying it correctly. demo.zip I put in most of the files that i find myself modifying (P.S I used copilot to help me understand the code so it probably contributed to me failing so hard ;-;)

HansBambel commented 5 months ago

Hey, I had a chance to glance over your code. I did not execute it though. I have some follow-up questions:

  1. Do you have your own Python-Script for training? Or are you using the regression_lightning.py-script?
  2. When having a 4D input dimension and requiring 3D could sometimes be due to an introduced dimension of size 1. Then, most of the time, you can squeeze it. That's also what I do in regression_lightning.py: loss = self.loss_func(y_pred.squeeze(), y)
  3. I noticed that your image size is 250x90. Note that multiple down- and upsampling will not create the same dimensions. That is why in our experiments we are using images of dimensions 288x288. These can be downsampled easily: 288x288 --> 144x144 --> 72x72 --> 36x36 --> 18x18 (and then up again) --> 36x36 --> 72x72 --> 144x144 --> 288x288. With your dimensions it would be harder: 250x90 --> 125x45 (this would likely round down) --> 62x22 --> 31x11 --> 15x5 and then upsampling --> 30x10 --> 60x20 --> 120x40 --> 240x80.
  4. Concerning your bad results: It could be that the model was not able to learn properly due to the aforementioned issues.
  5. I assume you had a look at the created dataset in HDF format?

No problem using Copilot, but also make sure that you also understand it and verify that what it said was true ;)

kirbytran12324 commented 5 months ago
  1. I'm trying to use unet_precip_regression_lightning.py since this is my first project doing machine learning so i would rather have a bit of a spine as to which parts are required and also this is what i presumed to be what SmaAt-UNet uses. I do touch upon regression_lightning.py but more like to test on my own data.
  2. I also got a suggestion to do so but not quite sure where i would need to implement it. I originally assumed it to be in 'train_SmaAt_UNetatloss = loss_func(model(xb.to(dev)), yb.to(dev))and the value set to it isloss_func = nn.CrossEntropyLoss().to(dev). I see that inregression_lightning.py`, the loss function is shared between all UNet-based models, is that right? My initial impression of them is that they were only for the training of other U-Net models that isn't SmaAt so I paid less attention to it than others
  3. This is actually the one I'm stumped at the most. So essentially, since the dimensions would likely not going to be the same at the end of the whole process, it's might be giving NaN values for the missing pixels and the NaN values doesn't give back any valid metrics for comparison which leads to the training process always ending early? (This is how I see it, totally might be wrong) I actually have little idea on how to modify for this, I mean I could try to crop the data so that they are all in 240x80 to be uniform but it seems inefficient and not really a permanent solution if some important parts were caught cut off and the training results wouldn't be accurate
  4. The tensor problem isn't really the main issue as when I modified to fit whatever criteria I saw, it all ended up as NaN values, I'm still linking towards the dataset not being accurate or just not being downsampled correctly.
  5. Yes, I composed it using tiff_to_h5.py and then used create_dataset.py to apply necessary modification and then i just used the site https://myhdf5.hdfgroup.org/ to view it, mostly to check if there's anything going wrong, when I just composed the original hdf file demo.h5 i see that everything is fine, but when I made the child hdfs, i kept seeing memory access out of bounds or read permission problem https://drive.google.com/file/d/1VhEnEbh1TvxUVeHxfORJ-BVjPCQ4uZEX/view?usp=sharing (i linked my whole dataset here since it's not that big but still too large to put in a zip file for github). However, the main question i have is whether the create_dataset.py or tiff_to_h5.py is responsible or if it's just a limitation in the viewing programme.
HansBambel commented 5 months ago

Thanks for the answers.

  1. When it is your first ML project I would actually advise to use pytorch lightning. It helps with quite some overhead. For this, I put all models in unet_precip_regression_lightning.py. There SmaAt-UNet is UNetDS_Attention (this was the name of the model describing what it is composed of before I came up with SmaAt-Unet ;) )
  2. Try to do as little .to(dev) as possible. This is old-school and not necessary anymore. Lightning should take care of that when doing that once in the beginning of the training. Yes, the loss is for all the same. Note that the test_step-function has the hard-coded denormalizing value 47.83 in there. This is something that you would need to adapt when using a different dataset.
  3. Cropping is definitely something that you should consider. It is also something that I did with the original data (I even have some examples in the README.md. Why do you think it is inefficient? This could be done either in the pre-processing step or on-the-fly in the dataloader.
  4. Yes, it definitely could be that the downsampling is the issue here. Try to fix that first and then tackle the next problem. Debugging is part of the process ;)
  5. I used https://www.hdfgroup.org/downloads/hdfview/ for inspecting my dataset. Something you could also try is loading the created .h5 file and have a look there with the python package https://docs.h5py.org/en/stable/index.html. Just note that I used this data format since the images already were in hdf-format. Before this, I was used to normal images such as jpg, png or similar. You may be better off with using your tiff-files instead of converting them to a single .h5-file. Here is a good start to writing a data loader: https://lightning.ai/docs/pytorch/stable/data/datamodule.html

Hope this helps!

kirbytran12324 commented 4 months ago

So, a bit of an update for what I've done so far. I cropped the dataset to 240x80 which should be adequate for the whole downscaling process. and testing the forward function seems to work well so i think i'm good on this front. However, i still find myself not being able to train the model with train_loss_step, val_loss, train_loss_epoch all shown as nan.0 and everything stops at Epoch 0. Either I'm still trying to load the dataset incorrectly or there's still something wrong with the dataset. I am still thinking about using .h5 as i figured this is the better way to store all data in one place. As for the denormalizing value, I know that I should use Min-Max Normalization to find it but honestly, with the min being 0 and max being 260, i don't know how to apply this in the formula. Also a small note, since my dataset is measured in mm/h and not mm/5m, i assume i can just remove * 12 from the return process, I couldn't find if there's any other sections related to this yet though so there's might be some spots that i missed. SmaAt-UNet.zip https://drive.google.com/file/d/1VHpOYi7kU3AJa1Zu4s6TGHJtc5juccPb/view?usp=sharing (Public Dataset) image_2024-05-14_161318338

HansBambel commented 4 months ago

I cropped the dataset to 240x80 which should be adequate for the whole downscaling process. and testing the forward function seems to work well so i think i'm good on this front.

Great!

However, i still find myself not being able to train the model with train_loss_step, val_loss, train_loss_epoch all shown as nan.0 and everything stops at Epoch 0

Hmm, I can't open the zip file. Wouldn't it be easier to just create a git repository and send me a link to it? You can even fork from this repository.

As for the denormalizing value, I know that I should use Min-Max Normalization to find it but honestly, with the min being 0 and max being 260, i don't know how to apply this in the formula

The formula from wikipedia is quite easy to read in my opinion. With your min being 0 and max 260 you just need to divide all your values by 260. Then all values are between 0 and 1.

Also a small note, since my dataset is measured in mm/h and not mm/5m, i assume i can just remove * 12 from the return process

Sure. Just make sure that you are comparing the same values later on.

kirbytran12324 commented 4 months ago

I'm pretty sure I made a fork and even made it public but I guess it doesn't appear on the fork list it seems Anyways, here's the repo https://github.com/kirbytran12324/SmaAt-UNet The only things missing are the h5 files in the dataset folder since they are over the 100mb limit of Github but I linked the dataset zip file in gg drive above As for the denormalizing value, since I saw 47.83 being hard-coded in your example, I thought I would need to find a general value for all in the dataset, otherwise, i understand the formula just fine (actually learned about this on a video about fall-off damage in video games ;) )

HansBambel commented 4 months ago

Oh, you're right. Thanks for the link.

That is indeed a bit strange. I have never seen nan.0 before. Also googling is not giving any results. Have you tried setting a break point in the loss function calculation and debugging? I assume there is an issue there.

Btw, I think the loss = loss.squeeze() is not needed.

actually learned about this on a video about fall-off damage in video games ;)

Interesting, do you have a link? :D

kirbytran12324 commented 4 months ago

I'm not super familiar with debugging, unfortunately. I'm the type that usually go for the trials and errors approach, keep running, occasionally change things that i figure makes sense and hopefully it'd work. I tried setting breakpoints and use debug mode in Pycharm, but when I tried to run, nothing came up, the only options available were stop and rerun. Just a question to make sure I wasn't just dumb this whole time, I tried to train using train_precip_lightning.py and was going to test using test_precip_lightning.py, I'll look at the metrics part later. This is probably unrelated but how long does the train process take for you? I know your dataset is way larger than mine but mine finished way too quickly imo (about 2min max). Maybe some steps was skipped due to the nan.0 but I believe the training process would have still went for a few more loops or something.

Btw, I think the loss = loss.squeeze() is not needed.

Yea that was back when I was trial and erroring for the tensor issue since i thought it would remove the excess 1 in the target. That part is probably good now

Interesting, do you have a link? :D

I don't play games as much anymore but sometimes, it's easier to learn when presentation is in an interesting format, don't you think? https://youtu.be/VL2VnkNJPpE?si=lSIIRESALBVfleK6

HansBambel commented 4 months ago

I'm the type that usually go for the trials and errors approach, keep running, occasionally change things that i figure makes sense and hopefully it'd work.

That works to a certain extend, but when the problem has too many knobs and dials it becomes increasingly harder to find the root cause for the bug. I very much recommend debugging the problem using a debugger. It also forces you to understand and also to think about what you expect the value to be and compare it with the actual value you received. It boosted my understanding of the underlying algorithms and code by a lot!

I tried setting breakpoints and use debug mode in Pycharm, but when I tried to run, nothing came up, the only options available were stop and rerun.

Did you start the program in debug mode? It could be that when you use cuda breakpoints do not work. For that purpose you should use the cpu.

This is probably unrelated but how long does the train process take for you? I know your dataset is way larger than mine but mine finished way too quickly imo (about 2min max)

One epoch takes me roughly 1.5-2min, but it then runs for around 60 epochs. It definitely should not stop after just 1. There is definitely something going wrong.

it's easier to learn when presentation is in an interesting format

100%!

kirbytran12324 commented 4 months ago

I think I found what went wrong, not 100% certain though I noticed in your train_precip_lightning.py, in the __main__, you have different dataset folder for parser and args, what exactly is the purpose for those two? I set those 2 to be the same since I thought it wouldn't affect anything but now I'm reconsidering my decision there. I know this is probably a lot to ask but can you try compiling the dataset from the tiff files? I still have some doubts about it and I'm hoping to verify its validity, especially since i see quite a few -inf values in the dataset and i'm not quite sure how to deal with them. The files in my dataset are still in 250x90px but i cropped it to 240x80 in my demo.h5 and then modify it further using your create_dataset.py and if you're wondering if i used the correct dataset with the settings in the regression_lightning.py, yes, I checked it multiple times every time i changed the dataset

kirbytran12324 commented 4 months ago

Update: So I tried to remove the -inf like this def replace_inf_with_finite_image(image): """Replace -inf values with the minimum finite value in the image.""" min_finite = np.min(image[np.isfinite(image)]) image[np.isneginf(image)] = min_finite return image and it kinda worked, the nan.0 doesn't appear anymore, it runs multiple rounds now but the values shown are way too large image_2024-05-17_094743235

HansBambel commented 4 months ago

Progress! Great. Did you normalize the data after replacing the infinite values? That could be the reason for those high values.

I won't have time to run it myself as I'm going on holiday tomorrow.

kirbytran12324 commented 4 months ago

Oh yeah, I totally forgot about it for a second there So what's the optimal approach here? Do I create a separate group in the .h5 that contains the normalizing value for each of the images in order? I mean, isn't that the way to denormalize values in each of the maps separately? Coming back to your value of 47.83, I'm still not 100% certain about its purposes right now ;-; I'm making a wild guess and say this is the highest rain amount in your dataset? And I think it's related to the bins as well since my dataset is already in mm/h so I'm not sure how to apply that.

I'm going on holiday tomorrow.

Well, I wish you have a wonderful vacation :3

HansBambel commented 4 months ago

I'm making a wild guess and say this is the highest rain amount in your dataset?

Correct. You said above that the highest value you have in your dataset is 260, so you divide all values by that.

Do I create a separate group in the .h5 that contains the normalizing value for each of the images in order?

No, not for each image. By the highest value in your training set. For me the highest value was 47.83.

Do you know the reason why you normalize your data in general? Why the values are supposed to be between 0 and 1? If not, then I recommend having a look for why ;)

kirbytran12324 commented 4 months ago

so.... I went ahead and did it anyways, i normalized by 260 but even then, the loss values are still up in e+4 which sucks since I don't know if it's normal or not at this point train_loss_step seems to be very random, going around e+3 and e+5, val_loss and train_loss_epoch starts at 1.79e+4 and 4.9e+4 and slowly going down the whole process stops at epoch 9 image_2024-05-17_182723573

kirbytran12324 commented 4 months ago

Update: So, I was an idiot, forgot to change the name of the input file in create_dataset.py so the results was the same. Now that it's gone, the scripts seems to work properly. I say "seems" since obviously I'm running under suboptimal condition with the dataset in an 1h interval instead of 5m interval like normal U-Net-based models suppose to. The results were less than exciting, unfortunately, SmaAt-UNet seems to be inheriting the worse quality of the 2 parents. The only positive aspect to it is the model size being smaller. Though, didn't expect MSE to be so high, probably gonna figure out how to help the model manage long-range prediction better. image_2024-05-18_235607101

HansBambel commented 4 months ago

Thanks for the update! Good to hear (that you got it to work, not the bad performance ;) )!

I can't say what the reason for the performance is though. Could be multiple factors, the dataset most of the time being one of the main ones.

kirbytran12324 commented 4 months ago

I think it's time I close this issue, I got a few other projects that need attention right now and I'm gonna need a more suitable dataset for this anyways. Definitely coming back to this, the premise is too good to ignore. I'm gonna reopen the issue if anything new comes up. Thank you so much for the help :3