[Re] A Reservoir Computing Model of Reward-Modulated Motor Learning and Automaticity

Original article: A Reservoir Computing Model of Reward-Modulated Motor Learning and Automaticity

PDF URL: https://github.com/rsankar9/Reimplementation-SUPERTREX/blob/main/ReScience_submission/article.pdf Metadata URL: https://github.com/rsankar9/Reimplementation-SUPERTREX/blob/main/ReScience_submission/metadata.yaml Code URL: https://github.com/rsankar9/Reimplementation-SUPERTREX/tree/main/Code

Scientific domain: Computational Neuroscience Programming language: Python, MATLAB Suggested editor:

I will handle the edition of this submission.

@MehdiKhamassi @degoldschmidt would you be interested to review this submission?

@benoit-girard Thank you for the invitation! Unfortunately I am currently too busy. So I have to decline. Best wishes

@benoit-girard if you don't mind, I'd be happy to jump in reviewing this submission. The interesction of recurrent networks and reward-modulated/reinforcement learning is matching my interests and fluidity with Python (/Matlab) is given as well.

@MehdiKhamassi : OK, no problem. @schmidDan : Thanks for proposing, as there is no obvious conflict of interest, and as you seem to have the right expertise, you're welcome as reviewer 1!

@piero-le-fou @ChristophMetzner @stephanmg @junpenglao @thmosqueiro @rgutzen Is one of you interested in reviewing this submission?

Sorry, I don't have time to do a review at the moment.

@benoit-girard What is the expected deadline for a review to be finished? Right now I've little time, but would still be interested.

@ChristophMetzner Thanks for answering! @stephanmg We have no general policy (so far) at ReScience C. Do you think you could do it within 3 weeks? i.e. aiming at the 15th of April.

@benoit-girard thanks for the kind words. I'm right now a bit swamped with other tasks. I could start reviewing the submission starting from 1st of May. (I guess this might be too late)

@stephanmg Well, if I can find another reviewer who can handle this submission earlier, I will rely on her. Otherwise, I will solicit you.

@stephanmg Well, if I can find another reviewer who can handle this submission earlier, I will rely on her. Otherwise, I will solicit you.

No worries, I'd be happy to do it, just the time concern right now. So let's see how it goes.

@benoit-girard, I'd also be happy to review this submission, and likely could get to it within the next week(s).

Perfect! Thanks a lot @rgutzen you are reviewer 2, then. @stephanmg we won't need your help (this time).

@benoit-girard @rsankar9 Here is my review. I hope you find it helpful, and I'm happy to discuss if any points are unclear.

Overall, this is a very clearly structured paper and replication. Especially the investigation of the model limitations is very valuable. It's easy to read and to understand. The reimplementation is understandable and runs out of the box. I have some comments, mainly concerning how the agreement of the reproduction/replication is evaluated and presented.

Replication/Reproduction

It is great that you perform both a reproduction with the original code and a replication with your own implementation and separate clearly between them. Just in the discussion, their evaluation could be described more precisely.
The main issue I see in the submission, is the lack of explicit criteria used to judge how successful the reproduction and replication are. You use vague qualitative assessments such as "successfully closely", "function as presented in the paper", and "model does reproduce the results", however it is unclear how you arrive at this evaluation and what this entails in detail. Shouldn't the traces in the paper be reproduced exactly by using the same Matlab code and seed? What features of the trace deviation are relevant or just the average deviation from the target? Are some deviations from the target trace more or less worse than others? You should preferably aim for a quantitative evaluation of the similarity of traces that incorporates the relevant performance measures of the model, or if you provide a qualitative evaluation 'by eye', it is helpful to explain which features of the model behavior you focus on. In the figures, also a 'distance from target' measure as in the original paper is not shown, which would be already very helpful in that regard.
Using such a more quantitative evaluation the comparison of paper, Matlab, and Python version (+ modified) could be more differentiated. For example, from the figures alone the Python version seems to have more variation (be less stable/ more chaotic) than the Matlab version, nearly throughout the tasks (especially 1&2). This is however not addressed or explained.
Another point, that is mentioned but apparently not really factored into the evaluation of the model is its dependency on the initial seed. A model that deviates from its intended behavior in 50% of cases just because of the choice of seeds seems jarring, especially as it is not reported in the original paper. It lets the reader wonder to what degree the original results were accidental and cherry-picked. To better grasp the influence of the seed and robustness of the model I'm missing the figures of the runs with different seeds in the paper (e.g. as supplements) and in the best case a quantification of the variation due to the chosen seed.

Text

It would be helpful for the reader to state the level of reproduction and replication already early in the intro. Something in the realm of 'successful with minor/moderate limitations and modifications'
missing an explanation of 'e' in equation 2
"verify performance of the model". This formulation is somewhat confusing to me. Evaluating a model's output against another would rather be a validation instead of a verification, and performance may also refer to the usage of computational resources instead of accuracy. I would suggest a reformulation explaining more precisely what you are doing e.g. evaluate the accuracy/robustness/behavior of the model / validate the model against its Python re-implementation / verify whether the original script produces the published results.
2.1. l.7: "Post the training phase" When is training stopped? Is the training time fixed? What are the criteria?
Figure 4. is not referenced in the text
Where the agreement of the model outputs is described as "successfully closely" or similar, a more precise statement would be suited based on prior defined criteria (as described above).
Similarly, the discussion needs to be more differentiated and transparent in how you come to your conclusions e.g. "For Task 1 and 2, we can now confirm that the three algorithms function as presented in the paper. For Task 3, the SUPERTREX model’s behaviour is also reproducible, ..."

Typos/Grammar/Suggestions

l.5 fail -> fails
l.6 utilise -> use
In the first sentence of the task descriptions, e.g. "Here, we compare the simulations of the author scripts and our re‐implementation for Task 1 using FORCE, RMHL and SUPERTREX, with ..." last comma is not needed.
to initialise -> initializing in "To do so, we run the simulations with the default seed and repeat it ten times with different seeds to initialise the random number generator." otherwise it may sound like you run the simulations to initialize the RNG.
demarcates -> marks the separation of
in 3. Modification: use numbered bullets instead of "* One, ..."
in 4. Discussion, l.3: out -> our

Figures

This might not be technically possible, but it would help with the visual inspection of the butterfly traces if all three instances (Matlab, original, Python) would have more similar linewidths and colors (and same butterfly height/width ratio).
As already mentioned above, the figures would benefit from also including a distance-from-target trace, similar to the MSE in figure 5.
Why don't you also show the time series from the paper for comparison?
Fig. 2 caption: "shows the actual output". Aren't both rows the actual output just in different representations? Maybe reformulate as 'time series.
Fig. 3, a) & c): missing time series. Is there a particular reason for that?
Fig. 5: Scale of MSE is very hard to read.
And again as a reader I would be very interested to also see the runs with other seeds or some aggregate evaluation of them.

Code

Regarding the Code and the repository, there is very little to object from my side. It is well structured, documented, easy to reuse, and reproduces all figures.
I wouldn't mind having a requirements.txt or environment.yaml file instead of the specification in the Readme.
The ten arbitrary seeds for the RNG are chosen separately for the different models, so the comparisons are both between models and seeds when they are not fixed (which can have a major influence, as you showed). This, however, doesn't become clear from the description in the paper.
Just for curiosity, why did you choose a custom Python implementation of the model, what kept you from using a simulator engine like Brian, Nest, or even PyTorch?

@rsankar9 I enjoyed reading and reviewing your submission. Manuscript and code are well written! I had some things to point out in both, code and manuscript. I'd consider them as minor improvements to the quality of the contribution, as the validity of your main claims are assured nevertheless. Please find my review below @rsankar9 @benoit-girard .

Review Summary

Full replication | Partial replication | Failed replication | Reproduction: I believe it's a "Partial replication", but I have two things to note here @benoit-girard:
- The authors mention in their manuscript's section "2 Comparison with Python Re-implementation" that they had access to the Matlab code of the original publication's authors. This essentially introduces a bias to the replication, which became evident by that the authors made modifications to their Python code based on how the Matlab code differs from methods described in the original publication. So I believe without the Matlab code at hand it would've been a "Failed replication" due to the seemingly missing information in the original publication. The authors consider their partial replication being a "Re-implementation", which sort of reflects this circumstance and very well point out the differences they discovered between the original paper and the Matlab code. I would consider it therefore a successful partial replication, but wanted to point out this circumstance nevertheless.
- Is there a way to denote the difference in replications between full and partial in the article's tag (i.e. [Re] vs. [Re\] or something)? As so far the manucsript is tagged as a "Replication" besides replicating not all experiments (i.e. "Partial replication").
Licensing: BSD-2-clause license for both Matlab code (by Pyle and Roberts) and Python code (by the contributors of this submission)
Reproducibility of the replication: I was able to run the Python code without any problems. The Matlab code did not work under GNU Octave 6.2.0. As Matlab is closed-source I'm not sure what the directive would be here either way @benoit-girard?
Clarity of the code: Overall well structured and documented code.
Clarity and completeness of the accompanying article: Concise and easy to read, captures the main claims and contributions of the original publication and points out robustness issues encountered in certain experiments as well as a possible solution. Some points could be alittle bit clearer (see my comments below).

Remarks w.r.t. the manuscript

Layout: reading through the references is a little bit complicated, since LaTeX obviously cluttered some graphics inbetween the list of references (Figs. 5, 6). The figures should be placed before the "References" section starts.
References: Reference 1 is missing the journal name and is formatted differently than refs 2-6 (title in bold instead of journal, no quotation marks around title etc.). Some journal names are in title case others in sentence case.
Framework:
- "This allows for partially unsupervised learning" (p.2) - what is meant by "partially unspervised learning"?
- The article should make for a self-contained read without reiterating every detail from the original publication. For my taste, this would include stating the complete model equations, i.e. reservior equations, computation of z_1, z_2 from r, and mentioning the numerical solver used to simulate the system of differential equations, so that the reader knows about the meaning of the weights updated by eq.1-3.
- While FORCE and RMHL acronyms are epxplained, this was omitted for SUPERTREX (p.2).
Task:
- "each tasks uses [...] multi-segmented arm" (p.2) - some tasks are based solely on the pen's position. Admittedly, one could view that as a 0-segment arm condition, though.
- Concise language: "A non-linear inverse transformation would be required" (p.2) - transformation of what? having read the original work this is clear to me, but might not be clear for someone who doesn't know/forgot about the details in the original work.
- Reading this section as well as "Task 1" and "Task 2" it might seem to the reader as if RMHL and SUPERTREX are learned with non-scalar error values for Task 1, as for Task 1 it is solely noted that "known target" output is used for training while for Task 2 it is explicitely stated that the error is a scalar. Indeed, for Task 1 the error (to the reward-modulated learner) is as well of scalar type. Critically, though the distinction comes from how the scalar is computed: output and target domain match in Task 1 - distance of the pen's position is indeed the target value of the output - the error couples linearly to the correct solution's output values -, while for Task 2 and 3 the output domain is angular values, but the target domain is the end effector's position - the error is non-linearly coupled to the correct solution's output values.
Task 1,2,3:
- The explanation should note, that during testing the feedback is provided via teacher forcing (cf. the original publication), since it was shown to be of quite important impact for the model performance.
Task 3:
- "We compare the performance of the MATLAB scripts and our Python adaptation for the three algorithms on Task 2, with the results presented in the article." - This sentence isn't clear to me. I believe it should say "Task 3" instead of "Task 2" here? Also describing your Python (re-)implementation as an "adaptation" might be a little bit confusing in the light that you later on indeed modify (i.e. adapt) your Python implementation to increase robustness.
- I don't believe the text accurately reflects the degree to which the results are in line with the original publication here. While the later sentences acknowledge a deviation from the target contour, the summarizing introduction ("The MATLAB scripts and the Python re‐implementation are able to successfully reproduce the results presented in the paper with the default seed as well as with the 10 arbitrary seeds for the RMHL algorithm.") raises the impression, that the results in the orginial publication look about the same. Indeed, they don't, which is also why you changed the seeds (which nevertheless do not make up for the complete discrepancy w.r.t. the original publication). See also my comments w.r.t. "Results" in "Remarks w.r.t. the implementation".
Modification:
- Figure 4 is never referenced, but I feel like it should be noted somewhere within this section?
- "We observe that RMHL performance is comparable to the original Task 2, [...]" - this is nowhere shown, right? So maybe add a "(not shown)".
- The "two minor alterations" you stated refer to Tasks 1-3 throughout the section. To me it is not entirely clear whether the results of the privous sections have been generated with or without these alterations. From the structure of the manuscript and the labelling of the figures I would expect that the alterations mentioned are only applied for experiments conducted within "3 Modification". But, on the other hand, why would one state then how these alterations relate to Tasks 1-3?
- It is great to have the two alterations stated as a bullet point list. What might be a little confusing is the reiteration of the two points (with an added interpretation of them) in the subsequent paragraph as by using "also" in "We also increase the error threshold [...]" it seems, on a first look, like here comes an additional alteration on top of the ones from the bullet point list, while it is indeed the second item of the list.
- "[...] the model is able to perform well with up to 50 arm segments (Figure 5)." - Again, I believe "well" is a little bit optimistic for some of the results. For e.g. 10 arm segments (Fig. 5) I wouldn't use the word "well", while for e.g. 6 arm segments this seems to be justified.
Figures:
- No indication for the temporal extent of the plots is given (x-axis) (cf. original publication).
- I believe the "Original" butterfly plots are taken from the original publication. If so, this needs to be stated (and potential permission issues would have to be clarified with the holder of their copyrights).
- The "Original" butterfly plot looks squeezed w.r.t. the horizontal extent. Is this just a visual feat, or as well corresponding to differencees in the underlying implementation?
- For viewing the document digitally, it might be beneficial to use vector graphic plots instead of pixelated information. Note the difference when zooming in into the figures. Just in case this is easily realizable and doesn't require too much effort.
- The different plots within a figure aren't drawn to the same scale (not the difference in "MATLAB" and "Python" red butterfly outlines, or the vertical scaling of e.g. x and y axes), which makes it harder to visually judge relative perofrmance. Same goes for the horizontal scaling of temporal axes where weight changes are depcited (i.e. the gray vertical bars don't align with them of the angular evolutions).
- No plots for the evolution of the performance metric (distance from target) and weights (norm of the weight matrix) are shown. This would aid the comparison with the original publication and, I believe, would make the case for a successful replication even stronger.
- Some plots (in Figs. 4,5) have scales for their vertical axis plotted, which is great. Unfortunately, the ticks of these scales are too small and squeezed to be interpretable.
- For some of the temporal plots a slighlty increased linewidth would helpful.
- Up to taste: Using the same color tone (same blue) for "MATLAB" and "Python" temporal evolution plots would be pleasant.
- Figure 1: It seems like "MATLAB" and "Original" butterfly plots are consisiting of one period of drawing the shape (blue line), while "Python" seems to have multiple periods visualized. Is this the case, or are the different periods just aligning perfectly?
- Figure 2:
- For the plots from the original article there seems to be some vertical gray artifact line to the left of each plot (it vanishes and appears w.r.t. my reader's zoom factor).
- The caption seems to have a copy-paste-error: ""The second row shows [...] (x and y coordinates, in this case) [...]" - actually, the angles are depicted (according to the vertical axes' tag).
- Figure 3:
- For the plots from the original article there seems to be some vertical gray artifact line to the left of each plot (it vanishes and appears w.r.t. my reader's zoom factor).
- Suplot "(a)"/"(c)": "The target time-series is imitated well by the model during the training phase [...]" - this information is not shown and should be denoted as such (i.e. "(not shown)").
- Subplot "(b)": The y-axes of the plots of evolutions of the angles are somehow having an additional, squashed tag for each angle on to their left. These are probably due to how the plots have been generated.
- Subplot "(c)": I'm not sure whether "[...] with slight divergences [...]" does the subplots do justice. Qualitatively, the results are still of butterfly shape, but quantitatively, I believe, the performance metric will be quite impacted during some phases of each period (not shown).
- Figure 4:
- The differently colored lines of the evolution of the matrices' norm isn't described (i.e. green = exploration, purple = mastery pathways / W_1 and W_2).
Discussion:
- "Only two necessary details were missing, [...]" - if they have been provided by the original authors upon request, it would be worth to mention it. Or have they been found by trial-and-error?
- "[...] the inclusion of a crucial learning rate of 0.0005 [...]" - can you explain which learning rate this refers to? If I read the original publication's method section correctly, then they state a learning rate of k=0.5 for Tasks 1 and 2 and k=0.9 for Task 3 (p. 1457).
- "[...] and a compensatory factor of 0.5 within the update of the readout weights of the exploratory pathway [...]" - I'm not sure what this exactly means wihtout an equation. I believe it's just another mulitplicative factor to the update. If so, why wouldn't one state it as part of the learning rate then? I.e. "[...] the inclusion of a crucial learning rate of 0.0005 for Task 1 and 0.00025 for Tasks 2 and 3". [edit: ...reading your code as well I understood that you treat it as a separate factor, since it is encapsulated in a "compensation" method.]
- "For Task 3, the SUPERTREX model's behaviour is also reproducible, [...]" - question to clarify: reproducible or replicable?
Smaller comments:
- fully supervised vs. fully-supervised: inconsistent usage of a hyphen throughout the text.
- Inconsistend usage of language: sometimes you're referring to the Python code as "re-implementation" sometimes as "adaptation". Or are these two referring to different parts of your code? If yes, which ones?
- Discussion: "provided by the author" should probably be plural ("authors") or "by the corresponding author"?
- Task 1: "The article claims that :" - a whitespace between "that" and ":".
- Task 3: "The article claim that" - should be "claims", ":" missing at the end.
- Discussion: "[...] with out modular and user-friendly Python replication." - should be "our".
- Discussion: "[...] for task 3" - throughout the text "Task ..." was used as a proper noun, so it should be "Task 3" here as well.

Remarks w.r.t. the implementation

I understood that the Matlab implementation was provided by the authors of the original publication. But did they as well agree to have it hosted on Github as well as being considered part of your submission to ReScience C?
README.md:
- The file mentions "The Python re-implementation has been submitted to ReScience C", but if I understand correctly I need to review the Matlab part as well to check for whether the results in your manuscript hold. So, if the Matlab part has to be part of my review, isn't it then as well part of the ReScience C submission? And if so, as it is provided by the original authors, how does this relate to tagging this submission as Replication vs Reproduction (@benoit-girard)?
Requirements are stated in e.g. Code/Python implementation/Reimplementation/README.md. I would welcome it having them as a proper requirements.txt (with versions enforced by "==") in the respective folders, as it would be the common way to tell a Python user which packages to install.
Comparison of code with original publication:
- Butterfly generator functions: Comparing with the original publication (Sec. 4.5 Tasks) it seems like there is a discrepancy in what was described and what is implemented in Matlab (respectively re-implemented in Python). They say x(t) = r(t)*cos(t) with t in 0 to 10^4 ms, while every generator method I checked in the implementation does x(theta) = r(theta)*cos(theta) with theta in 0 to 2pi. Maybe that would be worth a comment in the code or similar.
- For the psi function (Sec. 4.5 Tasks) it is written that Psi(x) = 0.025 * (10.0 * x)^0.25 is used for Task 3, while in the code (Matlab and Python) it is set to Psi(x) = 0.005 * (10.0 * x)^0.25. Maybe it would be good to point that out.
- The distance from target was said to be computed from "sqrt(\bar{MSE})" (p. 1455 of the original publication). Comparing with the Python code (e.g. ModelFORCE.py:272,379) the "sqrt" operation seems to be commented out(?).
- Smoothing time constant for z (\bar{z}) seems to be 10 according to the paper (p. 1455), while in the Python parameter files (e.g. "simulation_parameter_file_Task1_FORCE.json") tau_z: 1.
- Comments: At some part in the code there are comments like "can be removed" (e.g. ModelRMHL.py:165,174) or "Possibly wrong to use for" (ModelSUPERTREX.py:230) or unused variables like "unnec" in e.g. ModelSUPERTREX.py:117, or "smooth noise?" in Task2_ST_Seg2.py:134, which overwrites the result from Task2_ST_Seg2.py:133. Or "weird stuff in code" in Task.py:132. Are all/some of them are meant to make it into the published version of the code, or should they be removed beforhand?
Results: While the results are as described in the accompanying article, taking a look at all the results in the different "Results" folders, it seems like the Matlab implementation yields higher performance (lower MSE) comared to the Python implementation (e.g. comparing the "MSE.png" files for "FORCE_Task1" it seems like Matlab reaches as low as 10^-6, while python reaches just 10^-4). Do you have any idea why?
The way of initialization "J" (e.g. ModelFORCE.py:106-113) seems a little bit complicated. I'm just wondering whether scipy.sparse.random wouldn't be an alternative (with using the respective data_rvs argument)?

@rsankar9 : did you have the opportunity to update your submission based on the reviewer's comments? Moreover, as asked by @schmidDan do you have the authorization of the authors of the original paper to publish their matlab code here?

@schmidDan sorry for being so late to answer your questions:

I don't think that we have a title marker chosen for partial replications.
concerning the code, if I understand well, the code that is really submitted here as a replication is the python one. If so, it is the only one expected to work and to be reviewed.

Thank you for the very thorough and constructive reviews. I do apologise for my delayed response. I have made several changes to the manuscript, based on your reviews.

The plots have been made more clear.
Two metrics have been added to quantify the divergence. This has been both tabulated and indicated in the text and plots.
A terminology subsection has been added to clarify the terms used for each implementation.

More specific questions have been addressed inline for each review. The code and manuscript has been updated in the repository.

@benoit-girard : I do have the permission of the authors to publish their MATLAB code here. However, I do not have the permission to use the images from their paper (the trajeectory traces). I shall contact them regarding this and keep you posted.

@rgutzen : Thank you for your review! Please find below my responses to your comments (inline).

@benoit-girard @rsankar9 Here is my review. I hope you find it helpful, and I'm happy to discuss if any points are unclear.

Replication/Reproduction

* It is great that you perform both a reproduction with the original code and a replication with your own implementation and separate clearly between them. Just in the discussion, their evaluation could be described more precisely.

* The main issue I see in the submission, is the lack of explicit criteria used to judge how successful the reproduction and replication are. You use vague qualitative assessments such as "successfully closely", "function as presented in the paper", and "model does reproduce the results", however it is unclear how you arrive at this evaluation and what this entails in detail. What features of the trace deviation are relevant or just the average deviation from the target? Are some deviations from the target trace more or less worse than others? You should preferably aim for a quantitative evaluation of the similarity of traces that incorporates the relevant performance measures of the model, or if you provide a qualitative evaluation 'by eye', it is helpful to explain which features of the model behavior you focus on. In the figures, also a 'distance from target' measure as in the original paper is not shown, which would be already very helpful in that regard.

This is a very valid point and I understand the concern regarding the evaluation criteria. The replication was considered successful by visually inspecting the plot of the MSE of the trajectory, i.e. the deviation of the simulated trajectory from the target trajectory. To address this in a more quantitative manner, I have included two metrics in the manuscript now.
- One, the mean deviation. This measures the average deviation ("Distance from target") over the test phase, for each simulation, and then, presents a summary statistic over 11 simulations with different seeds for the random number generator.
- Two, I set a threshold on this metric to categorise the model performance as satisfactory or unsatisfactory. I present the proportion of simulations, of each task, for which the performance was satisfactory. Moreover, I mention the proportion of simulations for which the model fails to proceed to completion (when weights increase exponentially).
Hopefully, these two criteria should improve the quantitative assessment of the performance of the different implementations.
I've supplemented the vauge terms "succesful closely" with these two metrics, and substituted them with satisfactory/unsatisfactory wherever applicable.
I've included the MSE plots for the simulations, along with the mean deviation over the test phase. This should address if some anomalies in the test phase are better or worse than others.
Shouldn't the traces in the paper be reproduced exactly by using the same Matlab code and seed?
This is a valid point which we had also initially expected. However, this is not the case due to the differences between the random number generators in MATLAB and Python. While we were able to replicate the exact same behaviour for most MATLAB functions, we were unable to do so for one particular function, ``sprandn'' (also mentioned in the paper). This results in the initialisation of the J matrix being different and hence, in the differences in the final trajectories. To address this query better, I've included the results of one simulation, where I've plugged the J matrix produced in MATLAB into the Python adaptation, with the same random seed. You will see that in doing so, we obtain the exact same results.

* Using such a more quantitative evaluation the comparison of paper, Matlab, and Python version (+ modified) could be more differentiated. For example, from the figures alone the Python version seems to have more variation (be less stable/ more chaotic) than the Matlab version, nearly throughout the tasks (especially 1&2). This is however not addressed or explained.

This discrepancy was mainly due to an error on my part. Task 1 and Task 2-3 use teacher forcing in the test phase in two different ways. I had implemented them using the same method for Task 1-3. This led to the more chaotic results in Python. I've rectified that now, and verified that the adaptation is accurate.

* Another point, that is mentioned but apparently not really factored into the evaluation of the model is its dependency on the initial seed. A model that deviates from its intended behavior in 50% of cases just because of the choice of seeds seems jarring, especially as it is not reported in the original paper. It lets the reader wonder to what degree the original results were accidental and cherry-picked. To better grasp the influence of the seed and robustness of the model I'm missing the figures of the runs with different seeds in the paper (e.g. as supplements) and in the best case a quantification of the variation due to the chosen seed.

We do take into account the dependency of the initial seed. Even then, it is difficult to judge if the results were cherry-picked, as we are not aware of the initial seeds used by the authors. We mention that we assume that the default seed would have been used by the authors (due to lack of any mention to the contrary in their code), and make further inferences on this basis. To quantify this here, I've used the summary statistic of the mean deviation over 11 seeds, and the proportion of seeds with satisfactory performance. This should hopefully shed light on the extent of the dependance on the seed.

Text

* It would be helpful for the reader to state the level of reproduction and replication already early in the intro. Something in the realm of 'successful with minor/moderate limitations and modifications'

* missing an explanation of 'e' in equation 2

* "verify performance of the model". This formulation is somewhat confusing to me. Evaluating a model's output against another would rather be a validation instead of a verification, and performance may also refer to the usage of computational resources instead of accuracy. I would suggest a reformulation explaining more precisely what you are doing e.g. evaluate the accuracy/robustness/behavior of the model / validate the model against its Python re-implementation / verify whether the original script produces the published results.

* 2.1. l.7: "Post the training phase" When is training stopped? Is the training time fixed? What are the criteria?

* Figure 4. is not referenced in the text

* Where the agreement of the model outputs is described as "successfully closely" or similar, a more precise statement would be suited based on prior defined criteria (as described above).

* Similarly, the discussion needs to be more differentiated and transparent in how you come to your conclusions e.g. "For Task 1 and 2, we can now confirm that the three algorithms function as presented in the paper. For Task 3, the SUPERTREX model’s behaviour is also reproducible, ..."

I've included the state of the level or replication in the intro. "We were able to successfully reproduce the model performance in Python, for two tasks out of the three presented in the original article. For the third task, we were able to do so with limited robustness. We address this by introducing some modifications, and discuss how their inclusion vastly improves the robustness as well as scalability of the model."
Improved the explanation for equation variables.
I've replaced the usage of "verify" wherever not applicable. For e.g. "To validate the re-implementation". Regarding the aim of the manuscript, it attempts to both verify the results presented in the original article, as well as validate the Python re-implementation. It, therefore, continues to use the word "verify" where applicable. This distinction should be more clear in the manuscript now.
I've included more information about this in Section 1.2. I've used the same protocol presented in the original article. Yes, the training time is fixed.
I've included the reference to Fig 4.
Using the two metrics, a more precise statement has been included wherever applicable.
Our inference should now be more transparent with the help of the metrics for deviation and proportion of satisfactory runs, for accuracy and robustness, respectively. To be more concise, I've edited the conclusion to "We conclude that the results presented in the paper are reproducible using the original scripts provided by the authors, and replicable in Python with comparable performance."

Typos/Grammar/Suggestions

* l.5 fail -> fails

* l.6 utilise -> use

* In the first sentence of the task descriptions, e.g. "Here, we compare the simulations of the author scripts and our re‐implementation for Task 1 using FORCE, RMHL and SUPERTREX, with ..." last comma is not needed.

* to initialise -> initializing in "To do so, we run the simulations with the default seed and
  repeat it ten times with different seeds to initialise the random number generator." otherwise it may sound like you run the simulations to initialize the RNG.

* demarcates -> marks the separation of

* in 3. Modification: use numbered bullets instead of "* One, ..."

* in 4. Discussion, l.3: out -> our

I believe it should be fail.
Edited.
Edited to: "Here, we compare the simulations of the original scripts and our adaptation for Task 1, using FORCE, RMHL and SUPERTREX, with the results presented in the article. "
Edited.
Edited.
Edited.
Edited.

Figures

* This might not be technically possible, but it would help with the visual inspection of the butterfly traces if all three instances (Matlab, original, Python) would have more similar linewidths and colors (and same butterfly height/width ratio).

Fixed.

* As already mentioned above, the figures would benefit from also including a distance-from-target trace, similar to the MSE in figure 5.

Fixed.

* Why don't you also show the time series from the paper for comparison?

If I include them, the plots would be considerably scaled down, and this would impact readability. And since the MATLAB results should be the same, in all cases except Task 3, it doesn't add any extra information. In Task 3, since we were unable to reproduce their results, there seems to be no reason to include it for comparison. Also, I wish to limit the no. of figures we re-use from the original article.

* Fig. 2 caption: "shows the actual output". Aren't both rows the actual output just in different representations? Maybe reformulate as 'time series.

Edited across the manuscript.

* Fig. 3, a) & c): missing time series. Is there a particular reason for that?

Yes. I chose not to add these plots, at the cost of the clarity of Fig. 3b, as I believe in these 2 cases, it does not provide any additional information. It can be infered clearly from the previous 2 figures and the figures in the original paper.

* Fig. 5: Scale of MSE is very hard to read.

Fixed.

* And again as a reader I would be very interested to also see the runs with other seeds or some aggregate evaluation of them.

I've included an aggregate evaluation using the proportion over 11 simulations. I've also included the aggregate deviation metric in a table. The simulations with the other seeds are available in the repository in the 'Results' folder for each variant.

Code

* Regarding the Code and the repository, there is very little to object from my side. It is well structured, documented, easy to reuse, and reproduces all figures.

* I wouldn't mind having a requirements.txt or environment.yaml file instead of the specification in the Readme.

Included a requirements file.

* The ten arbitrary seeds for the RNG are chosen separately for the different models, so the comparisons are both between models and seeds when they are not fixed (which can have a major influence, as you showed). This, however, doesn't become clear from the description in the paper.

Included a clarifaction: "Except for the default seed, the ten arbitrary seeds are different for the Python and MATLAB simulations, and for each algorithm-task combination. "

* Just for curiosity, why did you choose a custom Python implementation of the model, what kept you from using a simulator engine like Brian, Nest, or even PyTorch?

Brain and Nest are simulators used generally for spiking neural networks. This is a rate-coded model, hence, it would be unnecessary. Also, the scientific stack in Python, including numpy, scipy, etc, is fully sufficient for a simple one or two layer reservoir computing framework. This removes the need for a more complex tool.

@schmidDan: Thank you for your review! Please find below my responses to your comments (inline).

Review Summary

* Full replication | Partial replication | Failed replication | Reproduction: I believe it's a "Partial replication", but I have two things to note here @benoit-girard:

  * The authors mention in their manuscript's section "2 Comparison with Python Re-implementation" that they had access to the Matlab code of the original publication's authors. This essentially introduces a bias to the replication, which became evident by that the authors made modifications to their Python code based on how the Matlab code differs from methods described in the original publication. So I believe without the Matlab code at hand it would've been a "Failed replication" due to the seemingly missing information in the original publication. The authors consider their partial replication being a "Re-implementation", which sort of reflects this circumstance and very well point out the differences they discovered between the original paper and the Matlab code. I would consider it therefore a successful partial replication, but wanted to point out this circumstance nevertheless.
  * Is there a way to denote the difference in replications between full and partial in the article's tag (i.e. [Re] vs. [Re\] or something)? As so far the manucsript is tagged as a "Replication" besides replicating not all experiments (i.e. "Partial replication").

I agree with this assessment. It's a partial replication due to two reasons.

I have replicated the three principal tasks (along with some variants to test robustness and scalability) mentioned in the paper. However, the authors proceed to test the model further with some other variants that I haven't included.
I agree that referring to the author's code does introduce a bias, especially since I've strived to have a close Python adaptation of the MATLAB scripts. However, I do believe it is possible to replicate the model successfully without the author scripts. The only crucial missing information in the paper, when I attempted to replicate without the MATLAB scripts, were the learning rates for the RMHL weight update.

* Licensing: BSD-2-clause license for both Matlab code (by Pyle and Roberts) and Python code (by the contributors of this submission)

* Reproducibility of the replication: I was able to run the Python code without any problems. The Matlab code did not work under GNU Octave 6.2.0. As Matlab is closed-source I'm not sure what the directive would be here either way @benoit-girard?

* Clarity of the code: Overall well structured and documented code.

* Clarity and completeness of the accompanying article: Concise and easy to read, captures the main claims and contributions of the original publication and points out robustness issues encountered in certain experiments as well as a possible solution. Some points could be alittle bit clearer (see my comments below).

Remarks w.r.t. the manuscript

* Layout: reading through the references is a little bit complicated, since LaTeX obviously cluttered some graphics inbetween the list of references (Figs. 5, 6). The figures should be placed before the "References" section starts.

* References: Reference 1 is missing the journal name and is formatted differently than refs 2-6 (title in bold instead of journal, no quotation marks around title etc.). Some journal names are in title case others in sentence case.

Fixed.

* Framework:

  * "This allows for partially unsupervised learning" (p.2) - what is  meant by "partially unspervised learning"?

  * The article should make for a self-contained read without reiterating every detail from the original publication. For my taste, this would include stating the complete model equations, i.e. reservior equations, computation of `z_1`, `z_2` from `r`, and mentioning the numerical solver used to simulate the system of differential equations, so that the reader knows about the meaning of the weights updated by eq.1-3.
  * While FORCE and RMHL acronyms are epxplained, this was omitted for SUPERTREX (p.2).

I have removed the sentence with the incorrect phrasing of "partially unsupervised learning".
I've included the model equations for the three algorithms. No specific numerical solver has been used to simulate the differential equations, since the reservoir dynamics and weights are merely updated iteratively, at each discrete timestep.
I've added the SUPERTREX acronym explanation.

* Task:

  * "each tasks uses [...] multi-segmented arm" (p.2) - some tasks are based solely on the pen's position. Admittedly, one could view that as a 0-segment arm condition, though.
  * Concise language: "A non-linear inverse transformation would be required" (p.2) - transformation of what? having read the original work this is clear to me, but might not be clear for someone who doesn't know/forgot about the details in the original work.
  * Reading this section as well as "Task 1" and "Task 2" it might seem to the reader as if RMHL and SUPERTREX are learned with non-scalar error values for Task 1, as for Task 1 it is solely noted that "known target" output is used for training while for Task 2 it is explicitely stated that the error is a scalar. Indeed, for Task 1 the error (to the reward-modulated learner) is as well of scalar type. Critically, though the distinction comes from how the scalar is computed: output and target domain match in Task 1 - distance of the pen's position is indeed the target value of the output - the error couples linearly to the correct solution's output values -, while for Task 2 and 3 the output domain is angular values, but the target domain is the end effector's position - the error is non-linearly coupled to the correct solution's output values.

I agree that one task (Task 1) is based on the pen's position alone without any manipulation of the arm segments. I have replaced this misleading statement. The paragraph now reads "three motor tasks, with increasing difficulty...the target of each task is to learn to produce a given spatio‐temporal signal, under different constraints...Task 2 and Task 3 use the paradigm of exploration by a multi‐segmented arm, pivoted at a point."
This sentence has now been replaced with "A non-linear inverse transformation of the trajectory would be required to compute the desired angles between the arm segments."
I agree with your assessment here. To address this I have replaced the phase "scalar error signal" with "indirect error signal", as I believe the main objective here is to emphasize on the increased complexity of the task.

* Task 1,2,3:

  * The explanation should note, that during testing the feedback is provided via teacher forcing (cf. the original publication), since it was shown to be of quite important impact for the model performance.

I've included this information in section 1.2. "It is also worth noting that the authors use teacher-forcing in the testing phase, which considerably improves the model performance by limiting the dependence on the stability of the learned solution (refer to Section "State information provides stability of learned output" in ~\cite{pyle2019})".

* Task 3:

  * "We compare the performance of the MATLAB scripts and our Python adaptation for the three algorithms on Task 2, with the results presented in the article." - This sentence isn't clear to me. I believe it should say "Task 3" instead of "Task 2" here? Also describing your Python (re-)implementation as an "adaptation" might be a little bit confusing in the light that you later on indeed modify (i.e. adapt) your Python implementation to increase robustness.
  * I don't believe the text accurately reflects the degree to which the results are in line with the original publication here. While the later sentences acknowledge a deviation from the target contour, the summarizing introduction ("The MATLAB scripts and the Python re‐implementation are able to successfully reproduce the results presented in the paper with the default seed as well as with the 10 arbitrary seeds for the RMHL algorithm.") raises the impression, that the results in the orginial publication look about the same. Indeed, they don't, which is also why you changed the seeds (which nevertheless do not make up for the complete discrepancy w.r.t. the original publication). See also my comments w.r.t. "Results" in "Remarks w.r.t. the implementation".

You're right. I've fixed the typo. Also, to address this ambiguity about the terms adaptation and re-implementation, I've added a terminology explanation in the beginning. We use the term Python adaptation, for the scripts wherein we have presented a close adaptation of the original MATLAB scripts, in Python. We use the term re-implementation for our Python scripts, which include the modifications and some simplifications to the author scripts (and hence, is not a strict adaptation, but indeed, a re-implementation). Hopefully, this resolves the issue. I'm open to suggestions for a better terminology.

* Modification:

  * Figure 4 is never referenced, but I feel like it should be noted somewhere within this section?
  * "We observe that RMHL performance is comparable to the original Task 2, [...]" - this is nowhere shown, right? So maybe add a "(not shown)".
  * The "two minor alterations" you stated refer to Tasks 1-3 throughout the section. To me it is not entirely clear whether the results of the privous sections have been generated with or without these alterations. From the structure of the manuscript and the labelling of the figures I would expect that the alterations mentioned are only applied for experiments conducted within "3 Modification". But, on the other hand, why would one state then how these alterations relate to Tasks 1-3?
  * It is great to have the two alterations stated as a bullet point list. What might be a little confusing is the reiteration of the two points (with an added interpretation of them) in the subsequent paragraph as by using "also" in "We also increase the error threshold [...]" it seems, on a first look, like here comes an additional alteration on top of the ones from the bullet point list, while it is indeed the second item of the list.
  * "[...] the model is able to perform well with up to 50 arm segments (Figure 5)." - Again, I believe "well" is a little bit optimistic for some of the results. For e.g. 10 arm segments (Fig. 5) I wouldn't use the word "well", while for e.g. 6 arm segments this seems to be justified.

I've included the reference to Fig 4.
Added 'not shown' in the caption.
I've altered the paragraph in the manuscript to increase clarity. To answer your question here, the first three figures have been generated without these alterations. However, these alterations do, indeed, refer to all three tasks. In the interest of uniformity, the new scripts for the three algorithms stay the same for all three tasks. So, I re-simulate the three tasks with the modified Python re-implementation. (To verify the performance in the existing satisfactory cases doesn't decrease.) I haven't shown the figures for Task 1 and Task 2, as there is no significant improvement. I have now included a deviation metric in this paragraph to make this more clear. I have also included the deviation metric to show the improvement in Task 3 performance. I haven't include the figures as I believe it doesn't convey any additional information to the deviation metric and the previous figures. I have only included the figures for the Task 2 variant to showcase the improved robustness and scalability of the model. => Tldr: I state these alterations in relation to Task 1-3, as with the alterations, the model has equal or better performance for all three tasks.
I've rephrased the reasoning for why the alterations improve performance as follows: "Alteration #1, by including a compensation factor for the change in number of arm segments, prevents the weights from increasing exponentially, and lets the simulation proceed in a meaningful manner. Alteration #2, by increasing the error threshold governing the transfer of information to the mastery pathway, makes the model more tolerant of fluctuations, while continuing to explore and learn a good solution."
This is a very valid point. I've included a deviation metric now, both as a summary statistic in a table, and indicated them on the MSE plots. So this should provide a more quantitative measure about the quality of performance. Also, I've rephrased the sentence using "satisfactory" instead of "well". This is based on the threshold I set for satisfactory vs unsatisfactory performance. Also the intention here is to show that the simulation proceeds without an exponential increase in weights. I have edited the paragraph to make this more clear.

* Figures:

  * No indication for the temporal extent of the plots is given (x-axis) (cf. original publication).

I have added this information in the caption now.

  * I believe the "Original" butterfly plots are taken from the original publication. If so, this needs to be stated (and potential permission issues would have to be clarified with the holder of their copyrights).

It's true. I will contact the authors and keep you posted. Meanwhile, I have added a footnote indicating these plots have been re-used from the original paper.

  * The "Original" butterfly plot looks squeezed w.r.t. the horizontal extent. Is this just a visual feat, or as well corresponding to differencees in the underlying implementation?

I use the aspect ratio of 1:1. Also, the scripts of the authors seem to use the aspect ratio 1:1. So, the plots, perhaps, from the paper look squeezed due to the way the journal prints it.

  * For viewing the document digitally, it might be beneficial to use vector graphic plots instead of pixelated information. Note the difference when zooming in into the figures. Just in case this is easily realizable and doesn't require too much effort.

All the figures are in "eps" format now.

  * The different plots within a figure aren't drawn to the same scale (not the difference in "MATLAB" and "Python" red butterfly outlines, or the vertical scaling of e.g. x and y axes), which makes it harder to visually judge relative perofrmance. Same goes for the horizontal scaling of temporal axes where weight changes are depcited (i.e. the gray vertical bars don't align with them of the angular evolutions).

I have now aligned the scales for corresponding plots.

  * No plots for the evolution of the performance metric (distance from target) and weights (norm of the weight matrix) are shown. This would aid the comparison with the original publication and, I believe, would make the case for a successful replication even stronger.

I have added the MSE subfigures to all tasks, and the W_norm plot to all tasks except task 3, due to space/legibility constraints.

  * Some plots (in Figs. 4,5) have scales for their vertical axis plotted, which is great. Unfortunately, the ticks of these scales are too small and squeezed to be interpretable.

Improved.

  * For some of the temporal plots a slighlty increased linewidth would helpful.

Fixed for all plots.

  * Up to taste: Using the same color tone (same blue) for "MATLAB" and "Python" temporal evolution plots would be pleasant.

Fixed for all plots.

  * Figure 1: It seems like "MATLAB" and "Original" butterfly plots are consisiting of one period of drawing the shape (blue line), while "Python" seems to have multiple periods visualized. Is this the case, or are the different periods just aligning perfectly?

Good catch. All three have multiple periods visualised (all 5 periods of the test phase). Two explanations for the Python one seeming more chaotic. One (minor), the Python one was sometimes not using the same timescale for the low pass filter used for the visualisation (Fixed that now). Two (major), the test phase uses teacher forcing. For task 1, the authors use the feedback of target output. For task 2 and 3, they use the feedback of the model's output from the training phase (since the target output is unknown). However, in my Python scripts, I had used the second type of feedback even in task 1 (in my attempt to maintain uniformity). This lead to the decreased performance in comparison. I've rectified that now. And the results should look comparable now.

  * Figure 2:

    * For the plots from the original article there seems to be some vertical gray artifact line to the left of each plot (it vanishes and appears w.r.t. my reader's zoom factor).

I was unable to reproduce this. I have re-uploaded new figures, in any case. Could you let me know if the issue persists?

    * The caption seems to have a copy-paste-error: ""The second row shows [...] (x and y coordinates, in this case) [...]" - actually, the angles are depicted (according to the vertical axes' tag).

Fixed.

  * Figure 3:

    * For the plots from the original article there seems to be some vertical gray artifact line to the left of each plot (it vanishes and appears w.r.t. my reader's zoom factor).

I was unable to reproduce this. I have re-uploaded new figures, in any case. Could you let me know if the issue persists?

    * Suplot "(a)"/"(c)": "The target time-series is imitated well by the model during the training phase [...]" - this information is not shown and should be denoted as such (i.e. "(not shown)").

Fixed.

    * Subplot "(b)": The y-axes of the plots of evolutions of the angles are somehow having an additional, squashed tag for each angle on to their left. These are probably due to how the plots have been generated.

Fixed.

    * Subplot "(c)": I'm not sure whether "[...] with slight divergences [...]" does the subplots do justice. Qualitatively, the results are still of butterfly shape, but quantitatively, I believe, the performance metric will be quite impacted during some phases of each period (not shown).

I've included a diversion metric now, both in the caption and, as an aggregate sumamry in a table. This should hopefully justify the statement.

  * Figure 4:

    * The differently colored lines of the evolution of the matrices' norm isn't described (i.e. green = exploration, purple = mastery pathways / `W_1` and `W_2`).

Fixed.

* Discussion:

  * "Only two necessary details were missing, [...]" - if they have been provided by the original authors upon request, it would be worth to mention it. Or have they been found by trial-and-error?

I mention these two details are not "described in the paper." However, they are available in the MATLAB scripts. (Before I had access to these scripts, I was using a slightly different value found by trial-and-error).

  * "[...] the inclusion of a crucial learning rate of 0.0005 [...]" - can you explain which learning rate this refers to? If I read the original publication's method section correctly, then they state a learning rate of k=0.5 for Tasks 1 and 2 and k=0.9 for Task 3 (p. 1457).

I use the term learning rate, as this is the variable name that the authors use in their script. This is different from the learning rate (k) that they use for SUPERTREX. This is used to control the weight update of the RMHL pathway.

  * "[...] and a compensatory factor of 0.5 within the update of the readout weights of the exploratory pathway [...]" - I'm not sure what this exactly means wihtout an equation. I believe it's just another mulitplicative factor to the update. If so, why wouldn't one state it as part of the learning rate then? I.e. "[...] the inclusion of  a crucial learning rate of 0.0005 for Task 1 and 0.00025 for Tasks 2 and 3". [edit: ...reading your code as well I understood that you treat it as a separate factor, since it is encapsulated in a "compensation" method.]

I agree there's no reason to not simply incorporate it in the learning rate. I mention it separately in the manuscript and in my adaptation, simply to maintain a parallel with the code of the authors, for readability purposes (it's not used in Task 2, while it is used in Task 3). However, from their paper and code, I cannot deduce any other function for this extra 0.5 factor other than scaling the learning rate. I would like to point out an error on my part here, though. The compensatory factor of 0.5 is only used in Task 3, and not in Task 2.

  * "For Task 3, the SUPERTREX model's behaviour is also reproducible, [...]" - question to clarify: reproducible or replicable?

From my understanding of the ReScience definitions (http://rescience.github.io/faq/), I would say "reproducible". In this sentence, I'm refering to original scripts+Python adaptation, wherin the relevant results are obtained for some seeds, but not most. In this case, we refer to the method of the original team => reproducibility. If we were to refer to the modified version, I believe, it would be replicable.

* Smaller comments:

  * fully supervised vs. fully-supervised: inconsistent usage of a hyphen throughout the text.
  * Inconsistend usage of language: sometimes you're referring to the Python code as "re-implementation" sometimes as "adaptation". Or are these two referring to different parts of your code? If yes, which ones?
  * Discussion: "provided by the author" should probably be plural ("authors") or "by the corresponding author"?
  * Task 1: "The article claims that :" - a whitespace between "that" and ":".
  * Task 3: "The article claim that" - should be "claims", ":" missing at the end.
  * Discussion: "[...] with out modular and user-friendly Python replication." - should be "our".
  * Discussion: "[...] for task 3" - throughout the text "Task ..." was used as a proper noun, so it should be "Task 3" here as well.

Fixed.
In this version, I've attempted to use one standard across the paper, and added a "Terminology" subsection to clarify the meanings.
Fixed.
Fixed.
Fixed.
Fixed.

Remarks w.r.t. the implementation

* I understood that the Matlab implementation was provided by the authors of the original publication. But did they as well agree to have it hosted on Github as well as being considered part of your submission to ReScience C?

* `README.md`:

  * The file mentions "The Python re-implementation has been submitted to ReScience C", but if I understand correctly I need to review the Matlab part as well to check for whether the results in your manuscript hold. So, if the Matlab part has to be part of my review, isn't it then as well part of the ReScience C submission? And if so, as it is provided by the original authors, how does this relate to tagging this submission as Replication vs Reproduction (@benoit-girard)?

* Requirements are stated in e.g. `Code/Python implementation/Reimplementation/README.md`. I would welcome it having them as a proper `requirements.txt` (with versions enforced by "==") in the respective folders, as it would be the common way to tell a Python user which packages to install.

Yes, I have taken their permission to host the code on my repository, along with a license with their names.
No comment.
Added.

* Comparison of code with original publication:

  * Butterfly generator functions: Comparing with the original publication (Sec. 4.5 Tasks) it seems like there is a discrepancy in what was described and what is implemented in Matlab (respectively re-implemented in Python). They say `x(t) = r(t)*cos(t)` with `t` in 0 to 10^4 ms, while every generator method I checked in the implementation does `x(theta) = r(theta)*cos(theta)` with `theta` in 0 to 2pi. Maybe that would be worth a comment in the code or similar.
  * For the psi function (Sec. 4.5 Tasks) it is written that `Psi(x) = 0.025 * (10.0 * x)^0.25` is used for Task 3, while in the code (Matlab and Python) it is set to `Psi(x) = 0.005 * (10.0 * x)^0.25`. Maybe it would be good to point that out.
  * The distance from target was said to be computed from "sqrt(\bar{MSE})" (p. 1455 of the original publication). Comparing with the Python code (e.g. ModelFORCE.py:272,379) the "sqrt" operation seems to be commented out(?).
  * Smoothing time constant for `z` (\bar{z}) seems to be 10 according to the paper (p. 1455), while in the Python parameter files (e.g. "simulation_parameter_file_Task1_FORCE.json") `tau_z: 1`.
  * Comments: At some part in the code there are comments like "can be removed" (e.g. ModelRMHL.py:165,174) or "Possibly wrong to use for" (ModelSUPERTREX.py:230) or unused variables like "unnec" in e.g. ModelSUPERTREX.py:117, or "smooth noise?" in Task2_ST_Seg2.py:134, which overwrites the result from Task2_ST_Seg2.py:133. Or "weird stuff in code" in Task.py:132. Are all/some of them are meant to make it into the published version of the code, or should they be removed beforhand?

I agree there is this discrepancy, despite the resulting trajectory being similar. I can include a note about it. I believe it could be x(t) = r(t)*sin(qt) with t in 0 to 10^4 ms, though. Would you agree?
I've included it in the manuscript now.
This error has been fixed now, and the new simulations have been included in the article and repository.
This is true. I've chosen this in accordance with the author scripts. Also, the plots generated with tau_z=1 seems visually similar to the plots on the paper to me
I've removed them now, except the 'unnec' variable. This variable makes some calls to the random generator, which helps match the number of calls of the MATLAB function "sprandn". This is required for the identical progression you see in Figure 7.

* Results: While the results are as described in the accompanying article, taking a look at all the results in the different "Results" folders, it seems like the Matlab implementation yields higher performance (lower MSE) comared to the Python implementation (e.g. comparing the "MSE.png" files for "FORCE_Task1" it seems like Matlab reaches as low as 10^-6, while python reaches just 10^-4). Do you have any idea why?

* The way of initialization "J" (e.g. ModelFORCE.py:106-113) seems a little bit complicated. I'm just wondering whether [`scipy.sparse.random`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.random.html#scipy.sparse.random) wouldn't be an alternative (with using the respective `data_rvs` argument)?

This is primarily due to the teacher forcing used in Task 1. I was not using the same feedback in the Python scripts, which led to this discrepancy. I have fixed that now, and added the deviation metric, which should provide further insight into this.
It's true that it's convoluted. And technically, it is definitely not necessary. However, there's also no shortcoming besides readability. The only reason I implemented it this way is that I was attempting a close adaption, and this is the closest method I can find to the MATLAB implementation. It handles redundancy the same way as MATLAB, which is not the case with the SciPy function. Also, this way I'm able to monitor the calls to the random generator and align it better with the MATLAB code. So, if I replace just the initialisation of the J matrix with the one from the MATLAB simulation, the rest of the simulation progresses in a way that is identical to the MATLAB. But, ofcourse, it's definitely not necessary for the replication, and the scipy function can be used too.

Thanks a lot @rsankar9 for this revised version of the submitted paper. @schmidDan and @rgutzen are you satisfied with the new version of the paper and the answers of the authors, or do you require an additional round of review?

The revised article looks very good to me, and I'm satisfied with how all the review comments were addressed. Especially, the addition of the error traces is very helpful as well as the quantification of the results in the overview tables. For me, there is no additional review round required. I just noted down a few minor things for the revised version, whose fixing I'd consider however optional.

Fig. 1 caption: I don't quite understand "X‐axis scale:‐ 1s : 1 period"
In several figure captions you write "The horizontal grey line, in the test phase, indicates the deviation metric." which may be a bit ambiguous whether this metric is imposed or derived. To make this clearer you could write '... indicates the average MSE which is used as the deviation metric'.
page 8, bullets: missing whitespace 'n=11)and'; additional whitespace 'FORCE algorithm. The'
Fig. 5: It took me a bit to understand that there is nothing missing on the left Matlab side. Maybe, you'd like to add a note or indication to the plot that the simulation broke after x seconds, to make this clearer to the reader.

Thanks for the revision. I was occupied the last weeks. I will look at the updates by the end of this week, so that you can expect an answer by the beginning next week at the latest.

Thanks for your responses. @rgutzen I'll try to incorporate these comments, and upload the revised version by next week.

@rgutzen Thank you for your comments. I've made some minor changes in response to them.

In Figure 1, I've added a graphical marker to indicate the scale.
On page 4, I have described the term deviation metric, as I'll be re-using it all across the paper. I've now emphasised it in bold, so that it is more clear to the reader.
Fixed the typos.
I've annotated the figure to make this clear.

@schmidDan did you have a look at the final vesion?

@benoit-girard @rsankar9 Thanks for your patience and appologies for the considerable delay (which was an unforseen one as well on my end). I'm back in the loop by now and pulled the latest changes. I should be done with having a look at the final version by tomorrow evening.

@rsankar9 Thanks for your responses and incorporation of my previous comments, as well as your patience. The article as well as implementation make for a great read now.

Review Summary

The article as well as the implementation have been greatly improved since my last review. As well, all my points have been addressed satisfactorily. Additional (sub-)figures and clear metrics for judging the quality of simulation outcomes make for a great, understandable read. Given the specificities of the work and the code base provided by the original paper's authors, I appreciate that the article's authors tried to kept as many parallels as possible in their implementation w.r.t to the Matlab implementation - this will help potential users with cross-checking between the implementations and fosters understanding. Furthermore, the code is nicely formatted and documented with insightful README.md files. Overall, I believe, this qualifies the article and code at their current state for publication. From my side, no additional round of reviewing would be required. @benoit-girard
While reviewing article and code I nevertheless spotted some minor points, which I still want to point the authors to. In my view, these can be tackled without the need for another review. Please, view these changes as optional either way.

Remarks w.r.t. the manuscript

In Table 1 you mention Note that for variant of task 2, the SUPERTREX statistics have been computed using only 2 simulations.. I believe this is true for the MATLAB and Python adaptation, but probably you used all available 11 samples for the Python re‐implementation (and thus want to mention that)?
Some axis labels seem a little bit warped (stretched horizontally or compressed vertically)
It is nice to have the scale of the horizontal axis included in Fig. 1, bottom right. Probably the other figures would profit from a similar indication.
Fig.4c subcaption could profit from mentioning which implementation has been used to produce the result.
Fig.7 caption last sentence: "Using the MATLAB scripts, the readout weights increase uncontrollably rendering the model unable to learn." I'm not completely sure here, but probably it is a left-over from another figure? Seemingly, it doesn't fit the context of Fig.7.
Minor typos etc. I spotted during reading (optional to be adapted):
- p.1: singular-plural "Most existing algorithms are built on fully supervised learning rules (e.g. FORCE [2]), which limits its potential" - should probably be "their potential"
- p.5 "The mean deviation , for" - there's a whitespace in front of the comma
- p.5 "[...] for both the original scripts (0.168±0.038; n=11)and" - there's a whitespace missing after the ")"
- p.5 "Figure 1,2;" - there's a whitespace missing after the comma
- p.12 "On the other hand, simulations of [...] seeds was able to" - should be plural: "[...] were able"
There are some overflow problems at the end of lines (optional to be adapted):
- p.10 tables overflow onto the page's right margin and also some columns overflow (e.g. "Task" column in Table 1)
- p.12 "11; Modified Python re‐implementation: 0.140 ± 0.071, n=11)." overflow onto the page's right margin
- p.16 "Python re-implementation" overflow onto the page's right margin
Answers to previous discussion points:

I believe the "Original" butterfly plots are taken from the original publication. If so, this needs to be stated (and potential permission issues would have to be clarified with the holder of their copyrights).

It's true. I will contact the authors and keep you posted. Meanwhile, I have added a footnote indicating these plots have been re-used from the original paper.

Thanks for adding the footnote. Just to make sure: Has there been an answer from the authors by now? (no action in the document required I guess)

For the plots from the original article there seems to be some vertical gray artifact line to the left of each plot (it vanishes and appears w.r.t. my reader's zoom factor).

I was unable to reproduce this. I have re-uploaded new figures, in any case. Could you let me know if the issue persists?

Seemingly the issue still persist for me, but I cannot tell why. Here is a screenshot from Fig. 3, b) at zoom factor 100% using Adobe Acrobat Reader on Windows 10 (although the issue is as well there for some of the other figures): link to the screenshot Still, if this is just an issue on my end, then there's no action required here.

Remarks w.r.t. the implementation

In order to be able to run your code I needed to (would be nice to have those adapted):

comment out in Python adaptation/ModelFORCE.py:101,116 usage of J_mat as the file var.mat doesn't exist
the requirement for json==2.0.9 throws an error (at least on Ubuntu https://stackoverflow.com/questions/41466431/pip-install-json-fails-on-ubuntu) and pip install -r requirements.txt breaks, but everything is still functional if commented out in the respective requirements.txt files
the requirement pandas is missing in the respective requirements.txt files

Thanks to both the reviewers for your thorough and helpful comments. I've made some modifications to the manuscript. Hopefully it should address your concerns.

Review Summary

* The article as well as the implementation have been greatly improved since my last review. As well, all my points have been addressed satisfactorily. Additional (sub-)figures and clear metrics for judging the quality of simulation outcomes make for a great, understandable read. Given the specificities of the work and the code base provided by the original paper's authors, I appreciate that the article's authors tried to kept as many parallels as possible in their implementation w.r.t to the Matlab implementation - this will help potential users with cross-checking between the implementations and fosters understanding. Furthermore, the code is nicely formatted and documented with insightful `README.md` files. Overall, I believe, this qualifies the article and code at their current state for publication. From my side, no additional round of reviewing would be required. @benoit-girard

* While reviewing article and code I nevertheless spotted some minor points, which I still want to point the authors to. In my view, these can be tackled without the need for another review. Please, view these changes as optional either way.

Remarks w.r.t. the manuscript

* In Table 1 you mention `Note that for variant of task 2, the SUPERTREX statistics have been computed using only 2 simulations.`. I believe this is true for the `MATLAB` and `Python adaptation`, but probably you used all available 11 samples for the `Python re‐implementation` (and thus want to mention that)?

You're right. For the re-implementation, it uses all available samples. I've updated the caption to make the distinction.

* Some axis labels seem a little bit warped (stretched horizontally or compressed vertically)

I see what you mean. This is caused due to difference in aspect ratio between the original figure (produced by the scripts) and the figure displayed in paper (scaled to fit by latex). This can be solved by changing the aspect ratio / figure size when plotting the original figure. If it's ok, I would choose not to fix this as it would require regenerating all the figures. Hopefully, the figures provided along with the manuscript, in its original aspect ratio, should be sufficient for readers who would like more details.

* It is nice to have the scale of the horizontal axis included in Fig. 1, bottom right. Probably the other figures would profit from a similar indication.

I've included it in the other figures now.

* Fig.4c subcaption could profit from mentioning which implementation has been used to produce the result.

The implementation used is the same as mentioned on top of the columns. I've added this information to the caption of Fig 4c to be more clear. I believe your doubt arises due to the difference in seeds. The reason for this is that since the default seed did not yeild the desirable result (as shown in Fig 4b), I've shown a sample from one of the 10 extra runs. These runs use randomly chosen seeds for the generator, and hence, do not match.

* Fig.7 caption last sentence: "Using the MATLAB scripts, the readout weights increase uncontrollably rendering the model unable to learn." I'm not completely sure here, but probably it is a left-over from another figure? Seemingly, it doesn't fit the context of Fig.7.

Good catch. Yes, it is a carry over mistake. Thanks for pointing out.

* Minor typos etc. I spotted during reading (optional to be adapted):

  * p.1: singular-plural "Most existing algorithms are built on fully supervised learning
    rules (e.g. FORCE [2]), which limits its potential" - should probably be "their potential"
  * p.5 "The mean deviation , for" - there's a whitespace in front of the comma
  * p.5 "[...] for both the original scripts (0.168±0.038;
    n=11)and" - there's a whitespace missing after the ")"
  * p.5 "Figure 1,2;" - there's a whitespace missing after the comma
  * p.12 "On the other hand, simulations of [...] seeds was able to" - should be plural: "[...] were able"

Fixed the minor typos.

* There are some overflow problems at the end of lines (optional to be adapted):

  * p.10 tables overflow onto the page's right margin and also some columns overflow (e.g. "Task" column in Table 1)
  * p.12 "11; Modified Python re‐implementation: 0.140 ± 0.071, n=11)." overflow onto the page's right margin
  * p.16 "Python re-implementation" overflow onto the page's right margin

Fixed the overflow issues.

* Answers to previous discussion points:

I believe the "Original" butterfly plots are taken from the original publication. If so, this needs to be stated (and potential permission issues would have to be clarified with the holder of their copyrights).

It's true. I will contact the authors and keep you posted. Meanwhile, I have added a footnote indicating these plots have been re-used from the original paper.

Thanks for adding the footnote. Just to make sure: Has there been an answer from the authors by now? (no action in the document required I guess)

We do have a response from the authors. They agree to for both their codes and their figures to be used with the appropriate license (included in the repository for MATLAB scripts) and citation (included in the I've filled a request form (as per https://direct.mit.edu/journals/pages/rights-permissions) and am awaiting their response.

For the plots from the original article there seems to be some vertical gray artifact line to the left of each plot (it vanishes and appears w.r.t. my reader's zoom factor).

I was unable to reproduce this. I have re-uploaded new figures, in any case. Could you let me know if the issue persists?

Seemingly the issue still persist for me, but I cannot tell why. Here is a screenshot from Fig. 3, b) at zoom factor 100% using Adobe Acrobat Reader on Windows 10 (although the issue is as well there for some of the other figures): link to the screenshot Still, if this is just an issue on my end, then there's no action required here.

I remain unable to reproduce this issue. However, I use a different OS and pdf viewer. If it doesn't disrupt the understability of the document, I reckon we let it be. I'm open to solutions, if someone else is able to reproduce this error.

Remarks w.r.t. the implementation

In order to be able to run your code I needed to (would be nice to have those adapted):

* comment out in Python adaptation/ModelFORCE.py:101,116 usage of `J_mat` as the file `var.mat` doesn't exist

* the requirement for `json==2.0.9` throws an error (at least on Ubuntu https://stackoverflow.com/questions/41466431/pip-install-json-fails-on-ubuntu) and `pip install -r requirements.txt` breaks, but everything is still functional if commented out in the respective `requirements.txt` files

* the requirement `pandas` is missing in the respective `requirements.txt` files

The corresponding changes have been made.

I believe the "Original" butterfly plots are taken from the original publication. If so, this needs to be stated (and potential permission issues would have to be clarified with the holder of their copyrights).

It's true. I will contact the authors and keep you posted. Meanwhile, I have added a footnote indicating these plots have been re-used from the original paper.

Thanks for adding the footnote. Just to make sure: Has there been an answer from the authors by now? (no action in the document required I guess)

We do have a response from the authors. They agree to for both their codes and their figures to be used with the appropriate license (included in the repository for MATLAB scripts) and citation (included in the I've filled a request form (as per https://direct.mit.edu/journals/pages/rights-permissions) and am awaiting their response.

Just a quick update regarding this. The journal has granted nonexclusive permission to reprint the figures and equations from “A Reservoir Computing Model of Reward Modulated Motor Learning and Automaticity” in this article, provided that the original Neural Computation article is cited appropriately.

@rsankar9 thanks for the very last updates on the paper. Based on the positive feedback provided by the reviewers, @rgutzen @schmidDan the paper is accepted for publication in ReScience C. I will start now the tedious editing process, I will soon ask you to recompile your paper with updated data.

@rgutzen @schmidDan : thanks for the time and expertise invested in this review process!

@rsankar9 : I updated the metadata.yaml file, I made a Pull Request on your repository for that. I need you to recompile your pdf with the following data: updated editor & reviewer information, and dates of acceptance and publication. After that I will finalize publication.

@benoit-girard Thank you. I have merged your PR. The PDF has the updated information. Feel free to let me know if there's anything else to be done.

Nothing more to be done on your side, I just have to solve my identification problem with github, and then it should be over...

Ok, I see that the process has been finished (by @rougier ?), and that the paper is officially published and referenced on the journal website. Congratulations!

ReScience / submissions