CH-Earth / summa

Structure for Unifying Multiple Modeling Alternatives:
http://www.ral.ucar.edu/projects/summa
GNU General Public License v3.0
80 stars 105 forks source link

SUMMA interacts with more forcing files than needed for specified simulation length #501

Open wknoben opened 2 years ago

wknoben commented 2 years ago

Bug Reports

KyleKlenk commented 1 year ago

I addressed this in SUMMA-Actors and the way I did it was I modified the file-manager to have two more options. This allows Summa to know how many files it should read.

The extra options were: forcingFreq 'month' ! the frequeny of forcing files (month, year) forcingStart '1979-01-01' ! starting date of the forcing file list

If this is acceptable I can write out a broader plan for changes.

andywood commented 1 year ago

Hi Kyle, A few thoughts on this. I'm not sure there's a strong motivation to make this change. There's value in avoiding extra entries in the file manager and other input files, mainly to limit clutter (in the file manager and code).

It's also probably best to avoid having the user provide information that SUMMA can decipher internally if needed (eg forcingStart or frequency, which if needed could be determined as SUMMA reads the forcing file list). Questions:

Perhaps a halfway solution is to add a line or two of code so that SUMMA stops checking forcing files once it has confirmed it has the desired run period covered. This might be there now - I forget and it's been a while since I looked at that part of the code.

But I'm not sure I see the current behavior of SUMMA as significantly problematic to add new entries to the fileManager. If they are added, they should also be made backward compatible so that anyone wishing to ignore the entries can do so. Handling backward compatibility also can lead to code bloat.

Cheers, Andy

On Fri, Oct 28, 2022 at 2:32 PM Kyle Klenk @.***> wrote:

I addressed this in SUMMA-Actors and the way I did it was I modified the file-manager to have two more options. This allows Summa to know how many files it should read.

The extra options were: forcingFreq 'month' ! the frequeny of forcing files (month, year) forcingStart '1979-01-01' ! starting date of the forcing file list

If this is acceptable I can write out a broader plan for changes.

— Reply to this email directly, view it on GitHub https://github.com/CH-Earth/summa/issues/501#issuecomment-1295436839, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIKARKRU4OVMY5T5SHD3NTWFQZ4HANCNFSM5MNQJZBA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

KyleKlenk commented 1 year ago

Hi Andy,

Thanks for the feedback, I have been experimenting with SUMMA for some time and it is good to get some feedback on places in the code where changes should be avoided like the fileManager.

I ran some tests that are hopefully helpful to this open issue.

Dataset: North America - 492 Forcing Files - 517,315 GRUs - Chunked at 1000 GRU per chunk Machine configuration: 64 CPU - 2TB RAM - Connected to Network Area Storage with 10 Gbit connection netCDF version: 4.5.2

Running 1 GRU for one month with 1 forcing file in the forcingFileList.txt

Running 1 GRU for one month with loading 492 forcing files in the forcingFileList.txt

This I think would be the worst case scenario. When running more than one GRU at a time the running of the GRUs starts to take up the majority of the runtime.

Hopefully this information is helpful. I did some extra testing on another machine that is closer to a typical consumer PC with 4-CPU, 16GB of RAM and 1 GBit connection. It was getting the benefits of caching from the NAS which made the results of loading in one file vs multiple basically the same. I can share them if needed once the cache is cleared but I expect they will be similar to the above with the difference coming from the network connection speed to our NAS.

Thanks again for your comments, Kyle

Some extra information on why I made changes in my experimental code that should have been addressed on my side before making changes to the code.

andywood commented 1 year ago

Hi Kyle, Thanks, this is excellent testing information. It does appear that there is an edge case for which loading all the files can be costly. That said, in that particular testing case (a test running 1 gru for a massive domain for 1 month), a user could probably be expected to take the additional step of just using an alternate forcingFile list. Even so, I'm actually surprised it's that costly and think it's worth taking a look at the code around loading those files to see if there is anything repetitive going on. There may be inefficiencies -- we have been rooting these out in the SUMMA i/o since the start of trying to apply it to full-scale applications, but I'm sure we're not done.

Based on this, however, I'd still lean against adding extra fileManager entries. It's a key input file, and is used in a lot of different ways in different applications, so adding anything extra there means it is something that cannot be done in any other way. I don't think this rises to that level, but others are of course free to weigh in here. If it is added, it should be backward compatible so that current usage is not required to change.

Thanks for doing the benchmarking on this ... that was really informative.

Best, Andy

On Sat, Oct 29, 2022 at 12:36 PM Kyle Klenk @.***> wrote:

Hi Andy,

Thanks for the feedback, I have been experimenting with SUMMA for some time and it is good to get some feedback on places in the code where changes should be avoided like the fileManager.

I ran some tests that are hopefully helpful to this open issue.

Dataset: North America - 492 Forcing Files - 517,315 GRUs - Chunked at 1000 GRU per chunk Machine configuration: 64 CPU - 2TB RAM - Connected to Network Area Storage with 10 Gbit connection netCDF version: 4.5.2

Running 1 GRU for one month with 1 forcing file in the forcingFileList.txt

  • Run Time: 1 Second

Running 1 GRU for one month with loading 492 forcing files in the forcingFileList.txt

  • Run Time: 30 Seconds for first run
  • Run Time: 1-3 Seconds for subsequent runs (looks like caching is taking effect)

This I think would be the worst case scenario. When running more than one GRU at a time the running of the GRUs starts to take up the majority of the runtime.

Hopefully this information is helpful. I did some extra testing on another machine that is closer to a typical consumer PC with 4-CPU, 16GB of RAM and 1 GBit connection. It was getting the benefits of caching from the NAS which made the results of loading in one file vs multiple basically the same. I can share them if needed once the cache is cleared but I expect they will be similar to the above with the difference coming from the network connection speed to our NAS.

Thanks again for your comments, Kyle

Some extra information on why I made changes in my experimental code that should have been addressed on my side before making changes to the code.

  • When I was doing testing for my experimental version of SUMMA I was using a cluster with slurm and requesting 1-CPU and 1-GB of RAM in an interactive session. This configuration may have made the problem look worse than it actually was as there may have been little benefit from caching. During this testing I was usually testing with one GRU at a time and not multiple but different configurations of forcing files, again making the problem look worse than it may be for the typical user.

— Reply to this email directly, view it on GitHub https://github.com/CH-Earth/summa/issues/501#issuecomment-1295930850, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIKARLNPD64FZX56EXVWCTWFVVB3ANCNFSM5MNQJZBA . You are receiving this because you commented.Message ID: @.***>

wknoben commented 1 year ago

Hi Kyle,

Just dropping a few thoughts here.

  1. I agree with Andy that changing the fileManager is probably best left as a last resort option. If we can get the same improvements by making the code cleverer than that's probably the way to go.

  2. The caching solution is not something that can be relied on in an HPC environment, because cached memory is ephemeral. There's no guarantee that the data persists in cache, especially if you request a memory allocation that's just about enough to simply run the model. I had a long conversation with our HPC team about this that you might find interesting (see: https://jira.usask.ca/servicedesk/customer/portal/2/ISD-364972).

  3. I agree with Andy that it might be worthwhile to have a thorough check of the forcing reading code, to see if it is as optimal as it can be.

Cheers, Wouter

KyleKlenk commented 1 year ago

Thanks for the comments,

I can take a look at the source code and can report what I find and propose changes here. The reasoning for not changing the fileManager makes sense to me. I applied this change naively in Summa-Actors a while ago and looks like it needs to be reverted. Because I need to do the revert for our experimental version looking into other solutions and proposing them here should be no problem.

That is a good point about the caching on HPC systems too. I remember you sharing that conversation a while back but it was a good refresher and I agree not something to rely on.

Thanks, Kyle

KyleKlenk commented 1 year ago

Hi Andy and Wouter,

I went through the source code and this is what I found inside ffile_info.f90.

The forcingFileList.txt is read in and then how ever many lines are there is how the length is determined for the array of forcing files.

Then inside a do loop, each file in that array is opened and the descriptive information such as variable names and the number of time steps within that netCDF forcing file are used to populate the forcFileInfo data structure. When the user gives too many files in the forcingFileList.txt the loop will not exit until it has opened and populated information for every netcdf file that was listed in forcingFileList.txt.

It seems to me a check could be implemented to break the do loop here so once we have passed the forcing files that are not needed we do not continue. This is I think what you had in mind.

The scenario of how to handle the situation where the user has more forcing files in the beginning of the list, ie. start date starts at a forcing file that is not the first in the forcingFileList.txt could be covered by the same check but instead of breaking the loop we move to the next iteration. The catch here is that read_force.f90 will still open the beginning files unless iFile is set to the file that corresponds with the start of the simulation and not the start of the forcingFileList.

I can clarify as needed if something doesn't make sense. I am still learning how best to communicate these issues. In the end this doesn't happen if the user is responsible for ensuring the number of forcing files is correct for their simulation.

Best, Kyle

andywood commented 1 year ago

Hi Kyle, Thanks for looking into this. I'd guess it could probably be more tailored. We can't assume that the forcing files are named in a way that conveys their temporal range, so there's avoiding opening them and checking the time variable. However, the time variable should be the only one read before deciding if it's worth reading other variables. If the end time is before the simulation start time, the process can be escaped, and once the simulation end time is found in a file, the forcing read can be escaped. In the end, this is actually quite a common practice for models. I don't think we need to control it more through additional user input, but if you see inefficiencies such as reading all the variables before checking the time limits, we could refine that behavior. And then move on ... it does seem to be a rare case where this would significantly impact run time, and there are a lot of big SUMMA fish to catch and fry. (see the issues list)

In general, I find that as you're working on models, you come across numerous things that you can improve or fix. The challenge is not finding imperfections but rather figuring out which to spend time on, and in what order :)

Cheers, Andy

On Wed, Nov 2, 2022 at 9:41 AM Kyle Klenk @.***> wrote:

Hi Andy and Wouter,

I went through the source code and this is what I found inside ffile_info.f90.

The forcingFileList.txt is read in and then how ever many lines are there is how the length is determined for the array of forcing files.

Then inside a do loop, each file in that array is opened and the descriptive information such as variable names and the number of time steps within that netCDF forcing file are used to populate the forcFileInfo data structure. When the user gives too many files in the forcingFileList.txt the loop will not exit until it has opened and populated information for every netcdf file that was listed in forcingFileList.txt.

It seems to me a check could be implemented to break the do loop here so once we have passed the forcing files that are not needed we do not continue. This is I think what you had in mind.

The scenario of how to handle the situation where the user has more forcing files in the beginning of the list, ie. start date starts at a forcing file that is not the first in the forcingFileList.txt could be covered by the same check but instead of breaking the loop we move to the next iteration. The catch here is that read_force.f90 will still open the beginning files unless iFile is set to the file that corresponds with the start of the simulation and not the start of the forcingFileList.

I can clarify as needed if something doesn't make sense. I am still learning how best to communicate these issues. In the end this doesn't happen if the user is responsible for ensuring the number of forcing files is correct for their simulation.

Best, Kyle

— Reply to this email directly, view it on GitHub https://github.com/CH-Earth/summa/issues/501#issuecomment-1300714505, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIKARIGGQPWPKYSZFUUCTTWGKDTHANCNFSM5MNQJZBA . You are receiving this because you commented.Message ID: @.***>

andywood commented 1 year ago

btw, 'avoiding' should be 'no avoiding'!

On Wed, Nov 2, 2022 at 7:46 PM Andrew Wood @.***> wrote:

Hi Kyle, Thanks for looking into this. I'd guess it could probably be more tailored. We can't assume that the forcing files are named in a way that conveys their temporal range, so there's avoiding opening them and checking the time variable. However, the time variable should be the only one read before deciding if it's worth reading other variables. If the end time is before the simulation start time, the process can be escaped, and once the simulation end time is found in a file, the forcing read can be escaped. In the end, this is actually quite a common practice for models. I don't think we need to control it more through additional user input, but if you see inefficiencies such as reading all the variables before checking the time limits, we could refine that behavior. And then move on ... it does seem to be a rare case where this would significantly impact run time, and there are a lot of big SUMMA fish to catch and fry. (see the issues list)

In general, I find that as you're working on models, you come across numerous things that you can improve or fix. The challenge is not finding imperfections but rather figuring out which to spend time on, and in what order :)

Cheers, Andy

On Wed, Nov 2, 2022 at 9:41 AM Kyle Klenk @.***> wrote:

Hi Andy and Wouter,

I went through the source code and this is what I found inside ffile_info.f90.

The forcingFileList.txt is read in and then how ever many lines are there is how the length is determined for the array of forcing files.

Then inside a do loop, each file in that array is opened and the descriptive information such as variable names and the number of time steps within that netCDF forcing file are used to populate the forcFileInfo data structure. When the user gives too many files in the forcingFileList.txt the loop will not exit until it has opened and populated information for every netcdf file that was listed in forcingFileList.txt.

It seems to me a check could be implemented to break the do loop here so once we have passed the forcing files that are not needed we do not continue. This is I think what you had in mind.

The scenario of how to handle the situation where the user has more forcing files in the beginning of the list, ie. start date starts at a forcing file that is not the first in the forcingFileList.txt could be covered by the same check but instead of breaking the loop we move to the next iteration. The catch here is that read_force.f90 will still open the beginning files unless iFile is set to the file that corresponds with the start of the simulation and not the start of the forcingFileList.

I can clarify as needed if something doesn't make sense. I am still learning how best to communicate these issues. In the end this doesn't happen if the user is responsible for ensuring the number of forcing files is correct for their simulation.

Best, Kyle

— Reply to this email directly, view it on GitHub https://github.com/CH-Earth/summa/issues/501#issuecomment-1300714505, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIKARIGGQPWPKYSZFUUCTTWGKDTHANCNFSM5MNQJZBA . You are receiving this because you commented.Message ID: @.***>

wknoben commented 1 year ago

Reading Andy's comment, it seems that we can agree that implementing these procedures has the potential to make debugging large-domain, long-term runs a bit more convenient. I would add that making the model's behaviour a bit more intuitive has an extra benefit for new users, who might need a while to realize that reducing the forcing file list size can speed up debugging runs.

Seeing how this seems like a fairly low time investment, it might be worth it to implement these suggestions.