Discussion 01 - June 2018 - Evan Goldstein leading - Kratzert et al. 2018 EarthArXiv Preprint

ebgoldstein commented 6 years ago

Hi Everyone!

Here are four discussion questions to get the ball rolling. Feel free to chime in with your own discussion points/comments/question.

1) What are some an important 'take-aways' for you as a researcher from this paper (I don't neccesarily mean the most objectively important scientific ‘point’) — the parts of the paper that you found particularly intereting or useful)? Personally, I think LSTM offers some clear advantages in being able to glean insight from ML, in addition to its predictive ability — the clear example is p20 (L18-27) and Figure 14, where the cell state actually has physical meaning (related to snow water storage). that is a neat result. Additionally I really liked Figure 4, seeing the parameter tuning at each epoch. And I thought the mapping of terminology between ML and earth sci modeling was a super nice touch.

2 A goal of this paper was to demonstrate the possibility of using LSTM artificial neural network architecture as a rainfall-runoff model. Are the methods or results generalizeable? Why or why not?

3) How if at all does this paper (results and/or the method) relate to your own work? Could it inform future work? I think that geomorphic systems, which often exhibit ‘memory’, could benefit from data-driven prediction approaches such as LSTM since it was specifically designed to account for memory/storage effects.

4) I was drawn to this paper because it’s currently a preprint on EarthArXiv. Do you have any suggestions for the author to make the paper stronger? We can pass along feedback to the corresponding author. I could not find the code, which I would be interesting in seeing (i.e., the Python calls to Keras/Tensorflow).

ebgoldstein commented 6 years ago

@kratzert alerted my (via twitter) that the paper is also now in (open) review at HESSD. (I personally always learn something from reading the open reviews on EGU journals... )

kratzert commented 6 years ago

Hi everybody, I'm happy/proud that actually my first paper was chosen here for discussion and I'm willing to answer any question you have.

Regarding the code: I already explained it on Twitter but will add the explanation here as well. In principle I'm willing to publish the entire code needed to reproduce my results. The only thing why I'm waiting is that the code used to produce the results from the manuscript is using TensorFlow and I recently switched to PyTorch. If any revision (e.g. from the HESS reviewers) force me to re-run a lot of experiments, I guess I would rewrite my entire code base to PyTorch and then polish this code for with the final manuscript publication. If not, I'll clean the original code (add maybe some more documentation) and publish the TensorFlow version. Anyway, if you already have any questions concerning code snippets, feel free to ask. There is really no magic involved, especially in the rather standard LSTM model. Most of the code I wrote deals with data handling, pre-processing etc which is quite nasty for the CAMELS data.

narock commented 6 years ago

Thanks @kratzert for a very interesting and well written paper. I really enjoyed reading it and found it easy to follow despite my minimal understanding of hydrology. I wanted to first ask about the LSTM itself. I don't have any practical experience with these networks. What type of hardware resources are needed to train a model like this? Are the computational requirements similar to RNNs and other neural networks?

I found it particularly interesting how you concluded with a discussion of the "black-box-ness" of neural networks. This is a criticism of neural networks that, in my opinion, has not really been addressed in the ML community - although, this issue does seem to be gaining more attention lately. I liked how you addressed it head on. Could you elaborate on your final paragraph? I was interested in how else you might investigate the cell states to reveal dependencies and, potentially, new science. To me, this relates to @ebgoldstein's question 2. As the paper shows, LSTMs are comparable, and sometimes better, than traditional models. The methods appear to be generalizable and have very good predictive power. Although, I could imagine the lack of being able to answer how and why a specific model works could hinder uptake. I'm curious to see how explainability evolves in ML. Thanks for touching on this issue and thanks for an enjoyable paper.

kratzert commented 6 years ago

Hi @narock thank you very much for your kind words, it means a lot to me.

Regarding your questions:

Computational requirements: The computational requirements of a LSTM are higher than those of the traditional RNN, because of the extra gates and cell state that are calculated in each time step. The traditional RNN cell only needs to calculate Equation 1 (from the manuscript) in each time step, while the LSTM cell has to calculate Equation 2-7. What resources you need depends highly on the size of the network, that you specify as for other NN via number of layers and number of hidden units per layer. I can be more specific regarding the architecture I used (5 input features, 2 LSTM layers with 20 hidden units each + a fully connected layer mapping from the layer hidden state to the network output with input sequence length of 365 time steps). This is a relatively small architecture and thus can be potentially be trained on CPU only, although it is a lot slower. On my current laptop with an Intel Core i7-7700HQ @ 2.8GHz one epoch of training for the models of experiment 1 (one model per basin) needs approx. 30s with PyTorch. I think Keras/TensorFlow are faster on CPU but I can't test it at the moment. However, using the GeForce GTX 1060 I have equipped in my laptop as well, training time per epoch reduces to 0.25s. Therefore, for e.g. hyperparameter search a CUDA supported GPU is highly recommended. But if you just want to start and play around a bit with comparable network sizes, CPU works as well.

Regarding the black-box-ness: I agree with you that this issue is not often addressed (for LSTMs I know one paper by Karpathy et al (2015) in which they interpret cell states in the text/language domain). For one of my co-authors, this was maybe the biggest finding in our research (it came first to my mind, when the manuscript was actually done, this is why we added it as kind of outlook and did no further analysis). He was skeptic at the beginning of my research, because of the same argument (NN are black-box models). I got his attention when he saw the first results but with this finding he got really enthusiastic. Regarding your actual question about further analysis: I found this particularly cell state by plotting the evolution of all cell states over the time of one input sequence. This one stood out immediately, because it has such a clear pattern (continuous accumulation, then sharper depletion) but I think it was not the most scientific finding (albeit a effective one!). Another thing I did was to look at the correlation of cell states and input variables. This analyses revealed e.g. the cell shown in the figure below, which I loosely named "water-deficit-cell". The value of the cell state rises continuously but as soon as their is a (large-enough?) precipitation event, it resets the value drastically. In conceptual hydrological models this would map to maybe an inverse soil water storage?! But both examples are a rather one input variable to one cell state mapping and solely based on the cell state time series and input time series.

water_deficit_cell

A possible different approach could be to look directly at the weight matrices of e.g. the input gate and the new cell candidate matrix for the input variables (W_i and W_c in equation 3 and 4). From these matrices it would be possible to inspect the exact influence of a certain input variable onto a certain cell state. I never looked into it, rather thought of it now, but for e.g. the cell state in the image below one would expect a high (negative) weight for the precipitation variable and a influence of maybe the temperature or solar radiation (which maybe influence the gradient in the non-precipitation periods). However, what I did so far was to look, if concepts we know and implement in traditional hydrological models emerge in the LSTM as well. This is easy, because we know what to look for. But to find something new that could then potentially be adapted into traditional hydrological models is a harder task and I'm not sure how near in the future this will happen ;) With the code, I also plan to release the weights of all trained models for all experiments. Maybe I should add a Jupyter Notebook that shows how to extract the cell state time series, so everybody could play around a bit. Maybe the swarm intelligence will help to reveal something new!

I hope I could answer your questions, if not please let me know.

P.S.: Can I maybe also ask a question? Did you (everybody who reads this) understood the concept of an epoch? To my surprise this seems like a hard-to-get concept and is now also asked by the first reviewer. It was also hard at first for my non-ML related co-authors and over many iterations I thought I found a good explanation (Section 2.2 Page 7 Line 29ff). But it seems like not and I would like to enhance it. Maybe a concrete example with numbers would help? E.g.: If the training set has 1000 different training samples and we choose a batch size of 10, this would mean that one epoch consists of 100 iterations (1000 / 10) and in each iteration we draw 10 random samples without replacement from the 1000 samples we have in total. Is this helpful?

Update: Daniel Klotz, my co-author, contacted me yesterday and told me that he recently found another paper dealing with the topic of understanding the internals of RNNs/LSTMs. It is Strobelt et al. (2016), and in their paper they mention another publication by Li et al. (2016). I didn't read them, so I can't comment on the content, just saw that both focus on the text domain as well.

References:

Karpathy, A.; Johnson, J.; Fei-Fei, L. "Visualizing and Understanding Recurrent Networks" arxiv link
Li, J.; Chen, X.; Hovy, E.; Jurafsky, D. "Visualizing and Understanding Neural Models in NLP" PDF
Strobelt, H.; Gehrmann, S.; Pfister, H.; Rush A. M. "LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks" arxiv link

narock commented 6 years ago

Thanks @kratzert! This is very helpful and answers my questions. And thanks for the additional references. I had not seen those papers. Will definitely take a look.

I thought the notion of epoch was well defined. Although, it does (for me at least) become even more clear with your concrete example above. I think that would be helpful if you're able to fit it into the paper. I was surprised to read that the samples do not need to be ordered chronologically. I had not realized that LSTMs could operate in that manner.

kratzert commented 6 years ago

Glad that I could answer your question @narock and yes I'll try to get this passage concerning epochs into the adapted manuscript after the review. Shouldn't be a problem, because the first referee actually commented that the concept of epochs is not clear to him.

Concerning the shuffled training. Just to make sure there is no misunderstanding (please tell me if you understood it differently from the manuscript). Having random samples in a batch is possible because a) each sample (consisting of a 365 x 5 matrix containing the 365 days with 5 meteorological variables at each day) contains all information needed to predict the discharge at the last day of the input sequence. b) all prediction starts with empty hidden state and cell state vectors (vector of zeros). Therefore in a forward and backward pass it is possible to process e.g. 512 non-continuous days, because they are can be processed in parallel sharing the same network parameters but apart from this can be processed completely independent. The gradient that is used to update the parameters is calculated by averaging from the 512 samples in one batch.

And maybe a more intuitive explanation why it helps to shuffle the training data. Take the models of experiment 2. If we would not shuffle, the model would see for some iteration steps data from only one basin A. In these steps the network would adapt its parameters to function well for this specific catchment. Then it will see the data of the next catchments B and adapt to this catchment, eventually forgetting what it has learned before to predict basin A. And so on. If we shuffle the data and one batch of samples contains randomly data points of all basins, the network would learn to predict the discharge well for all basins at the same time and thus converging faster to a better overall model.

danklotz commented 6 years ago

Hi everyone,

I am merely a co-author, but wanted to chime in with an alternative view, which might be useful if you come from an environmental-modelling perspective. It is a bit hand-wavy, but there is the take:

In section 2.1.1 we gave a “hydrological interpretation” of the LSTM. One can view the calibration process similarly. For the sake of the argument let us ignore the parallel aspect for now and assume that the given model has a short enough warming period (the time it takes the model to be independent of the states at start-time).

We could then, in theory, calibrate any model by letting it run for a given period (e.g. a year) and just evaluate the last time steps (e.g. the last day of the year). Then, the evaluation points could be chosen randomly from the available data.

Alternatively, a "low cost-approximation" of the process might be applied by letting the (hydrological) model run for the entire period and choose a set of evaluation points randomly. Then remove these points for the next iteration of the optimization and choose another set of points randomly from the remaining time-series. This process is then repeated until every point of the entire time-series is sampled. Once this is done, the epoch is complete and the process could begin anew.

That might not be exactly the same, but would approximate the process somehow. I can think of several reasons why this is not done in practice; e.g. because of practical aspects such as computational costs (in general much higher for environmental models) and the lack of an efficient way to learn from specific error-points (no backpropagation). But, I believe that there are also philosophical reasons for this, asa path-dependency is usually not acknowledge within the calibration process for hydrological models.

P.S: Seeing your enthusiasm and engagement is very nice!

ebgoldstein commented 6 years ago

@kratzert: i thought your description of epoch was good in the paper and the concrete example in this discussion is very helpful (though i admit I am familiar with the concept). Thanks a zillion for providing links to the papers on evaluating a timeseries of cell state. I would be super grateful and interested to see a Jupyter nb showing how to do it. To me, this is a clear way to show people that the network is successfully emulating aspects of a traditional mechanistic model. And it's interesting that the network is being trained in way that gives such a large role to a single cell (vs distributing to multiple).

Also re: the ML vs traditional model debate, i just saw this paper — Baker RE, Pen ̃a J-M, Jayamohan J, Je ́rusalem A. 2018 Mechanistic models versus machine learning, a fight worth fighting for the biological community? Biol. Lett. 14: 20170660. http://dx.doi.org/10.1098/rsbl.2017.0660

kratzert commented 6 years ago

@ebgoldstein Thanks for the link. The title is catchy and I think it is a question that will be asked among many fields in the near future. Regarding interpretability: Another two methods I got introduced to during the last weeks are layer-wise relevance propagation (LRP) and Integrated gradients [2]. And because things move fast for me since I switched university I'm already in the middle of writing a manuscript of interpretability of LSTMs in the context of rainfall-runoff modelling. I'll let you know as soon as I have something to publish but since it will be probably part of a book I don't know how their policy is regarding pre-prints :/ (I'll ask them!) Anyway there will be some more work of me in this direction and possibly something for the AGU as well and most lately for this I would publish everything as reproduceable notebook!

[1] Arras, Leila, et al. "Explaining recurrent neural network predictions in sentiment analysis." arXiv preprint arXiv:1706.07206 (2017). [2] Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. "Axiomatic attribution for deep networks." arXiv preprint arXiv:1703.01365 (2017).

narock commented 6 years ago

Thanks again @kratzert for a great discussion. Please do let us know how your research progresses. I would also be interested in seeing a Jupyter nb if that is an option for you.

kratzert commented 6 years ago

Thanks from my side. Also I hoped for more people to ask question it was fun to answer your question and talk about my project. I'll update you as soon as there is anything relevant to tell and regarding the notebook: What exactly are you wishing? A notebook on how to train such an LSTM? Or more related to interpretability?

Edit: In case of still upcoming questions (also by others) feel free to post your questions here or contact me somewhere else and I do my best to answer your questions ;)

narock commented 6 years ago

I was hoping for more discussion as well. This is the start of our discussion group and we're sorting out the logistics. Your's was the very first paper we've reviewed. It's also the summer and I imagine several people are away. Thanks for helping us get started and for getting this discussion group off the ground.

Personally, I'm interested in learning more about LSTMs as well as exploring explainability/interpretability. I'd be interested in which ever notebooks you're able to share. Seeing how you trained the LSTM would be helpful validation of the tests I'm doing.

SimonTopp commented 6 years ago

Hi @kratzert , I'm a little late to the discussion, but @ebgoldstein just forwarded your paper along to me and told me about the journal club. Overall I found the paper super interesting, especially after getting some more contextual information through your responses to questions here. I'm not sure where you're at with the manuscript, but I think some form of the blurb you posted regarding shuffled training would be a nice addition. One quick question for you regarding the application of LSTM models to geoscience applications with strong memory components. My understanding is that using 365 days will capture seasonality and the seasonal (i.e. memory) impacts of the meteorological variables on run-off. How feasible would it be to incorporate slower moving memory signals like aquifer depletion and recharge or ENSO? How much do computational requirements increase as you increase the duration of memory you're trying to capture? Overall really well done! I look forward to reading some of your future work on translating ML models into the text domain.

kratzert commented 6 years ago

Hi @SimonTopp

I'm glad to hear that you found our work interesting and I hope I can help you with my following explanation. I'll maybe start more general with computational costs. Adding additional time steps of inputs would roughly increase the processing time linearly, because each time steps just repeats the same 5/6 equations. Adding more input features should not increase the processing time linearly (I say this without much of testing and purely from intuition), since the amount of operations stay equal only the rows/cols of the weight matrices increase. But the mathematical operations (e.g. matrix-vector-multiplication) are heavily optimized by the underlying hardware accelerator. Not sure if this answers part of your question. In general you are of course right: If we only provide 365 days of meteorological input signals as input only trends derived from this data can be learned and captured by the network. But if you have more data that influences the runoff generation you can of course add this information as e.g. an additional input feature at each time step. (I don't know what ENSO stands for, but e.g. aquifer level/depletion could be a good input feature).

I'm currently in parental leave because I became father a month ago but have to start now to work on the reviews. Therefore I'm still open for any comments and I try to see if I can adapt the passage about shuffled training.

If you have any further question or I completely misunderstood what you asked, please feel free to contact me at any time.

Best, Freddy

geo-journal / geo-journal.github.io

Discussion 01 - June 2018 - Evan Goldstein leading - Kratzert et al. 2018 EarthArXiv Preprint #3