[REVIEW] JuliaCon Paper

theogf commented 4 years ago

If it's okay I am going to gather all my comments for the review in https://github.com/JuliaCon/proceedings-review/issues/51 here:

First it's a very nice package which is for sure extremely useful. I really like the pipeline composition style and how easy and flexible it is to construct it. The output part is also well made and give a clear understanding of the results. But I still have a few comments about the paper :

[x] The structure fit! and transform! does not make always sense to me, I think it would be good to clarify better what they really mean, and eventually list the differences for the different types of "modules", i.e. if it is a reader, a filter or a classifier etc... It would also help for people wanting to create new filters. You also never explain what should be the output/input of each module in the pipeline.

[x] I would like to repeat @christianpeel point in #90: you give extremely nice examples on how to use package on real data, but giving an analysis on the individual results of each algorithm is not really necessary. I also made a small research and it seems that there exists no (open-source) tool comparable to yours :+1: !

More on the API side/documentation :

[x] I also share @christianpeel 's opinion on the naming of the filters, for an external person it is not so clear what they do from first sight, but as in #89 this is a minor comment.
[x] Connected to the last point, is there a reason you always pass Dict to your modules? When using autocomplete tools, in Juno for instance, it is always more practical to see what arguments are possible, especially when one is not familiar with all the packages.
[x] Similarly it would be nice to have inline documentation: for example calling ? DataValizer would give a direct explanation of the filter and the possible arguments.

PS : Is there any reason why you don't give the option to simply drop the missing values?

ppalmes commented 4 years ago

Hi @theogf,

In reply to the paper review: https://github.com/JuliaCon/proceedings-review/issues/51

Thanks for your nice comments and suggestions.

[ ] Regarding fit! and transform!, the type hierarchy looks like this: MachineLearner <: TSLearner <: Transformer Filter <: Transformer Pipeline <: Transformer
- fit! in machine learner requires training of parameters for regression or classification. fit! in machine learner requires both input and output. On the other hand, fit! in a filter process only the input because it doesn't learn any mapping between input and output. fit! in a filter is typically a straightforward computation of some statistics such as normalization stats, range, or PCA/ICA parameters for space embedding, etc.
- The transform! in both machine learner and filter apply these computed or learned parameters to the new dataset. In many types of filter,fit! just merely checks for errors or initialize some variables but all important operations are done during the transform!. In the machine learner, most critical operations are done during fit! to learn the parameters of input->output mapping and transform! is just application of parameters to new data.
[ ] Regarding individual results
- Again, thank you for your nice comments and observations. Indeed, we are developing TSML to make modeling much easier by leveraging on existing toolsets and add some other missing functionalities. The example results are meant to provide the typical workflow from data pre-processing, modeling, and benchmarking. As we know, model selection and parameter optimization are the main activities in developing an optimal model. It's not enough to rely on just one model based on the No Free Lunch Theorem. The examples demonstrate that you can trivially parallelize model selection and parameter optimization using the TSML pipeline due to the parallel support in Julia. The same parallelism may require much longer and complex coding if implemented in other languages.
[ ] Regarding filter names
- I understand that the naming does not follow the convention but once you get used to it, you may find their functionality self-explanatory.
[ ] Regarding Dict for collection of parameters in type/struct
- Indeed, you have a good point. I am in the process of redesigning the interface but as of now, the focus is in stability and meta-learning ensembles. Since TSML relies on calling both Caret and Scikitlearn libraries through RCall and PyCall, the new interface should have a good flexibility in dealing with the different arguments that can be passed in both R and Python.
[ ] Regarding inline documentation
- definitely, I'm working on the documentation of each function. Currently, I am focusing on the tutorial and this publication but once I can finalize the changes to the new APIs, I'll finish the documentation of each function.
[ ] Regarding option to drop missing value
- In time series, it's not possible to drop the missing value because date period should be consistent. If the data is hourly, there must be a value every hour. Otherwise, most models won't be able to finish computation or will have inconsistent result because there is a gap in time.

ppalmes commented 4 years ago

@theogf, @christianpeel, @matbesancon: The just released TSML 2.3.9 now has inline documentation for the types and most important functions. It includes inline examples too.

IBM / TSML.jl

[REVIEW] JuliaCon Paper #91