[JOSS Review] Notes from @ha0ye

This includes all my notes for the unchecked boxes for my JOSS review, as well as additional suggestions for improvements.

Documentation

[ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?

(see comments on statement of need in "Software paper" below)

[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).

An example using the denmark included dataset is provided in the README. It is a bit long, but primarily shows the main functions of the package.

I don't know how useful the Ease of use section is at the bottom of the README. If it's an important draw for the package, then it should be higher up. Otherwise, it's not likely to be convincing after someone has already read through the example. You could consider expanding on both the example and the Ease of use section in a vignette. You could then provide advice in the README and perhaps .onAttach() for users to refer to the vignette for a detailed example.

[ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?

Yes, in the "Breaking down the ARDL package" section of the software paper. This should be moved somewhere else, possibly a separate vignette describing the model objects of the package. I appreciate the authors documenting this, but this is specifically noted as not fitting the guidelines for the software paper.

[ ] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?

The package is missing tests. Since there are only a few functions, a bit of dedicated effort to convert some of the examples into testthat tests would be helpful and not too much work, I think. See https://r-pkgs.org/tests.html for details.

[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

No, but this is a relatively easy fix:

add a BugReports field to DESCRIPTION, with link to https://github.com/Natsiopoulos/ARDL/issues
add a CONTRIBUTING file to the github repo, rOpenSci has a template at https://github.com/ropensci/dotgithubfiles/blob/master/dotgithub/CONTRIBUTING.md

Software paper

[ ] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?

It is not clear with ARDL or ECM are, though it is implies that this package is implementing these approaches. Lines 13-17 of the Statement of Need could be effective at as a summary, if an additional sentence were added that lists the functionality of the package, along with references for these methods.

[ ] A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?

Currently, the statement of need repeats much the same information as the summary, and could be clearer about the problem being solved by the software and the audience. Here are suggestions on how to improve it:

Explain what kind of flexibility is encapsulated in ARDL and ECM. (the summary mentions they are very flexible, which is fine and good for modeling approaches, but what kind of situation warrants this flexibility?)
The software is designed to simplify the construction of these models (i.e. the user need only specify the order for the ARDL model). This should be emphasized as improving the accessibility of these modeling approaches for e.g. students and users who apply time series modeling, but are perhaps not experts in software, the mathematics of time series modeling, etc.
The link to the manual and recent example do not really fit here. Instead, link to a vignette with examples, and move the Qiu et al. to the summary.
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?

Mostly yes, but I think this would be improved by merging the content of the "Ease of use" section of the README here, as well as the "Conclusion" section of the software paper.

[ ] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?

The language is acceptable, though some expansion of the modeling approaches for the time series novice would be appreciated.

[ ] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Citations should be added that expand on the time series methods.

Additional Notes

I recommend adding JOSS-paper to .Rbuildignore,
You have citation information in inst/CITATION already, so it'd be ideal to reduce duplication in the .onAttach() function. Something like format(citation("ARDL"), bibtex = FALSE) will generate much the same text.
Consider adding a CITATION.cff file to the github root (and add this to .Rbuildignore, too) - https://citation-file-format.github.io/
Consider creating a pkgdown documentation site. I ran it on my local clone of the repo without issue, and this will provide a nicer set of docs to link users to, e.g. from the README.
I didn't poke deeply into the code, but unlist(lapply(1:nrow(top_orders), FUN = function(z) sum(order2_back == top_orders[z,-ncol(top_orders)]))) is a bit awkward, and you might be able to get away with the built-in matrix::colSums() function or by using apply() over rows.

Hi @ha0ye Thank you for the thorough review. Here are our replies about how we treated each of your suggestions.

Functionality documentation:

Ha0ye: "Yes, in the "Breaking down the ARDL package" section of the software paper." "I appreciate the authors documenting this, but this is specifically noted as not fitting the guidelines for the software paper."

Answer: We removed the detailed descriptions of the equations to align with the guidelines for the software paper. We kept only the equations for ARDL and UECM as these are the building blocks of the package and along with the bounds tests they from the core functionality of the software.

Ha0ye: "This should be moved somewhere else, possibly a separate vignette describing the model objects of the package."

Answer: Thank you for the suggestion. This information is already included in the manual.

Automated tests:

Answer: Thank you for suggesting using testthat, it is indeed something that we plan to do. However, there are already excessive tests for every function, (most of them included in the internal function parsers as this is the workhorse behind every other function). These are not in an automated format though, but rather manual tests.

As manual tests are permitted, can this be considered as done?

Community guidelines:

Answer: Thank you for the suggestions and the useful links. BugReports and CONTRIBUTING have been added.

Summary:

Ha0ye: "It is not clear with ARDL or ECM are, though it is implies that this package is implementing these approaches."

Answer: As these are widely known models in undergraduate level of various fields of studies (e.g. economics, statistics, environmental studies etc), we don't think that this should be overspecified. Also, ARDL is self defined by its name and we believe that a technical definition does not fit the purpose of this summary, as this is supposed to describe the "high-level functionality and purpose of the software for a diverse, non-specialist audience". Although, we added a reference (Pesaran and Shin, 1999) which is the paper that introduces the ARDL and ECM as platforms for the cointegration test in the paper of Pesaran et al, (2001). Please let us know if you think otherwise.

Ha0ye: "...if an additional sentence were added that lists the functionality of the package, along with references for these methods"

Answer: Following your suggestion, we added the first sentence of the section "Statement of need" in the summary. This describes the functionality of the package. The reference of the cointegration test is already mentioned (Pesaran et al, 2001). If you refer to the ARDL and ECM models, these are not usually cited as they are too common and developed gradually through time by many researchers.

Statement of need:

Ha0ye: "Explain what kind of flexibility is encapsulated in ARDL and ECM. (the summary mentions they are very flexible, which is fine and good for modeling approaches, but what kind of situation warrants this flexibility?)"

Answer: We don't understand what you mean by "what kind of flexibility". Flexibility is simply the capability of a model to fit complex relationships (e.g. not linear/straight lines). This can be achieved because of the autoregressive (AR) and the distributed lag (DL, essentially AR terms of the independent variables) terms in the ARDL model. So, should we consider this point as done?

Ha0ye: "The software is designed to simplify the construction of these models (i.e. the user need only specify the order for the ARDL model). This should be emphasized as improving the accessibility of these modeling approaches for e.g. students and users who apply time series modeling, but are perhaps not experts in software, the mathematics of time series modeling, etc."

Answer: We found this advise very useful and we updated the "Statement of need" accordingly.

Ha0ye: "The link to the manual and recent example do not really fit here. Instead, link to a vignette with examples, and move the Qiu et al. to the summary."

Answer: The link to the manual was used to point to where the detailed explanation of every function can be found, as this is one of the main points in the statement of need. Sure, this kind of information could also appear in a dedicated vignette instead of the manual, but vignettes are not necessary parts of R packages nor of JOSS submissions. About the reference to Qiu et al. (2021), this is where the guidlines for authors and the example paper of JOSS suggests it should be. Should we consider this point as ok?

State of the field:

Ha0ye: "Mostly yes, but I think this would be improved by merging the content of the "Ease of use" section of the README here, as well as the "Conclusion" section of the software paper."

Answer: Thank you for the suggestion. We merged the information of the "Ease of use" and added some more information regarding this section.

Quality of writing:

Ha0ye: "The language is acceptable, though some expansion of the modeling approaches for the time series novice would be appreciated."

Answer: We believe that describing the ARDL and the ECM equations mathematically is sufficient for understanding these models as they are very commonly used even in undergraduate level. More details were given before, but following your suggestions, for the violation of the software paper's guidelines, we deleted the detailed descriptions and rewrote that part. Since we are a little bit confused about what you mean here, could you elaborate?

References:

Ha0ye: "Citations should be added that expand on the time series methods."

Answer: As we mentioned above about the "Summary", citations about the ARDL and ECM models would be very hard to be specified as they are very commonly used methods and their origins do not appear on specific papers. However, we added Pesaran and Shin, (1999) which is the paper that connects these common models with the paper of focus (Pesaran et al., 2001).

Additional Notes:

Answer: Thank you for these suggestions! They are indeed very useful and we fixed some of them. pkgdown is in our next plans and for sure we will take a look at the CITATION.cff.

The edits look good! I followed up on a few points below:

Automated tests:

Answer: Thank you for suggesting using testthat, it is indeed something that we plan to do. However, there are already excessive tests for every function, (most of them included in the internal function parsers as this is the workhorse behind every other function). These are not in an automated format though, but rather manual tests.

As manual tests are permitted, can this be considered as done?

Looking at the parsers function, you do seem to have extensive checking of inputs. However, the software testing that is expected is more like a correctness check. Usually of the form, given input X, the software produces result Y. This gives the user confidence that if they use the software, and the authors continue to evolve or improve on it, that they still have reproducible results with later versions. As mentioned in the review criteria, the tests do not have to be automated, but you would need to specify some kind of examples of the inputs and expected outputs.

Summary:

Ha0ye: "It is not clear with ARDL or ECM are, though it is implies that this package is implementing these approaches."

Answer: As these are widely known models in undergraduate level of various fields of studies (e.g. economics, statistics, environmental studies etc), we don't think that this should be overspecified. Also, ARDL is self defined by its name and we believe that a technical definition does not fit the purpose of this summary, as this is supposed to describe the "high-level functionality and purpose of the software for a diverse, non-specialist audience". Although, we added a reference (Pesaran and Shin, 1999) which is the paper that introduces the ARDL and ECM as platforms for the cointegration test in the paper of Pesaran et al, (2001). Please let us know if you think otherwise.

The review criteria expect that the summary be described "for a diverse, non-specialist audience." Maybe just add a phrase that these are time series models common in economics, statistics, environmental studies, etc. This way a reader who is unfamiliar with it can be sure that the software does not meet their needs, or has a bit of direction on how to search to find more background information.

Statement of need:

Ha0ye: "Explain what kind of flexibility is encapsulated in ARDL and ECM. (the summary mentions they are very flexible, which is fine and good for modeling approaches, but what kind of situation warrants this flexibility?)"

Answer: We don't understand what you mean by "what kind of flexibility". Flexibility is simply the capability of a model to fit complex relationships (e.g. not linear/straight lines). This can be achieved because of the autoregressive (AR) and the distributed lag (DL, essentially AR terms of the independent variables) terms in the ARDL model. So, should we consider this point as done?

The description that you gave here can be included in the statement of need. My question hinges on the fact that models are described as "flexible" but with different meanings, such as robust to missing data, noise, probability distribution, etc. A clarification is helpful! (Again, emphasizing a bit more explanation for individuals not familiar with these models specifically.)

The review criteria expect that the summary be described "for a diverse, non-specialist audience." Maybe just add a phrase that these are time series models common in economics, statistics, environmental studies, etc. This way a reader who is unfamiliar with it can be sure that the software does not meet their needs, or has a bit of direction on how to search to find more background information.

The description that you gave here can be included in the statement of need. My question hinges on the fact that models are described as "flexible" but with different meanings, such as robust to missing data, noise, probability distribution, etc. A clarification is helpful! (Again, emphasizing a bit more explanation for individuals not familiar with these models specifically.)

We treated these issues adding the missing information.

Looking at the parsers function, you do seem to have extensive checking of inputs. However, the software testing that is expected is more like a correctness check. Usually of the form, given input X, the software produces result Y. This gives the user confidence that if they use the software, and the authors continue to evolve or improve on it, that they still have reproducible results with later versions. As mentioned in the review criteria, the tests do not have to be automated, but you would need to specify some kind of examples of the inputs and expected outputs.

Assuming that all the other issues are ok now, we will start working on building the tests and we will come back when they are ready.

Hi @ha0ye, We created all the automated tests needed but there seems to be a problem that we can't fix yet. This is found when we run devtools::test() and devtools::check(). It is about the externally defined function AIC_pss() which we use in test-auto_ardl.R. Do you have any idea how we can fix it?

@Natsiopoulos I'm not sure. There's this - https://stackoverflow.com/questions/35941729/r-testthat-unit-test-data-and-helper-function-conventions

But I've always defined any additional variables or functions at the top of the individual test script that needs it.

@ha0ye - It seems that placing the function in the test files or placing the helper-*.R file in testthat folder is not an option anymore. It has to be as a non-exported function in the R folder, that's ok. Now, what's the next step? Should I set a proper version number (curently it is 0.2.0.9000 indicating the development state)?

Seems to work now. Maybe consider a release if JOSS approves the resubmission.

You could also look into add code coverage, to see where additional tests might be useful - https://github.com/r-lib/covr (100% coverage isn't needed, but you can verify that the important parts of the code are being tested).

I checked the package coverage with covr and also connected codecov with the repo. What's the next step now?

Natsiopoulos / ARDL