Open cboettig opened 3 years ago
Hi Dr. Boettiger, Thank you for taking the time of going through our research and leaving these constructive comments! Reproducibility is important to us, so I will make the time to address your comments as best I can and will report back. Best, Tom
Thanks! And sorry for the long list, it's mostly things I could figure out anyway but could help to clarify. The main part where I'm stuck is just how to load the best_model
saved by the dl_train
script and run it on some example timeseries data. Couldn't see where that step happens.
@ThomasMBury Great to see your paper out in PNAS today! :tada: Congrats again on a really nice analysis.
Also, any updates on this thread?
Thank you @cboettig! I'm still working on making this repo as intuitive as possible for viewers to run on their own systems. So far I've made edits code in /training_data such that
I've collected the code from my collaborator who trained the deep learning model, including the pkl files that contain the trained classifiers. It's in the repo under dl_train/. Combining the code to work with the same directory names etc. and writing the workflow is something I haven't go around to -- but is high up on my list of things to do (busy start to term!).
One other question regarding intermediate files. I uploaded intermediate files so viewers could run certain scripts such as the ROC computations from predictions made by the DL and EWS, without having to run the longer scripts of actually generating the training data / DL model. Is this appropriate, or should intermediate files be removed altogether?
Cheers, Tom
Hi @ThomasMBury , thanks for the update! Looking forward to trying it out, but in brief this sounds very good. A few replies:
In principle the cluster details can be abstracted away (e.g. in #rstats world drake
and now targets
do a reasonable job of handling execution over slurm or other queues as well as other parallelizations). But I'm not too worried about this part (I think this part worked out just fine for me after a few tweaks at least). The slurm script should help document more precisely what you ran, and having the simulated training data on Zenodo means that many users won't have to actually run it.
Having the pkl files is great. Definitely looking forward to trying the trained classifier out on some other timeseries. If you're really serious about this line of work, I think it wouldn't be bad idea to consider a little python packaging infrastructure so we can just install the trained classifier from PyPi (or at least from github) as a standard python module and call it on a given timeseries, but of course all software engineering takes time. Definitely looking forward to seeing the workflow code though, I think that will be a big help since that's where I got stuck on my first pass.
Personally I'm all for including intermediate files! I think the best-practices in the field always include these, but include some make
-like workflow setup so that the end products can be quickly re-made from these stored intermediate products, but the ambitious user also has the option to run something like make clean
to purge these intermediate objects and then see if they can reproduce things from scratch. (Again there's various prior art on this and some literature, but I know it more on the R side. I'm going to take the liberty of tagging @wlandau who is an expert in these things).
Cheers, Carl
Thanks, Carl. Anything specific I can help with?
Yes, intermediate files help break down the workflow into manageable pieces. targets
does this by abstracting each target as an R object and saving it in a data store to be retrieved as an R object later. The intermediate file is there, but to the user, it behaves like an R variable. IIRC there are Python pipeline toolkits that do this as well. Maybe Prefect? Definitely Metaflow (although both are more Airflow-like than Make-like).
https://github.com/pditommaso/awesome-pipeline has a bunch of other pipeline tools, mostly language-agnostic or Python-focused. Trying to remember the other one I saw with file/object abstraction, but it's not coming to me.
Thanks @wlandau I'll check those out.
Hi @cboettig, I've spent some time improving the workflow for training the DL classifiers (DL_training.py) and applying them to other time series data (DL_apply.py). They are also now connected via relative file paths. Let me know if you have any other comments/ issues with running the code. Thanks again for your feedback.
Thanks for the heads up! We'll give it a spin.
Hi Dr. Bury,
I stumbled across this GitHub repository and found the analysis and results you present most exciting! Thanks for sharing this excellent research in your public repository! While all-in-all I found this rather nicely organized and documented, I still struggled to follow through with a few parts, and so I thought I'd just pass along a few quick notes of potential stumbling blocks. As this may be a work-in-progress anyway, I thought perhaps these notes might help you further improve the reproducibility of your analysis for other readers who will no doubt be interested in following in your footsteps. Apologies if I've miss-understood or misconstrued anything in my notes below, and would definitely welcome your clarification on these issues!
Workflow
It would be nice to more clearly document the overall workflow of the project, from generation of training data & training to generation of the evaluation data and final evaluation and figure creation.
In particular, the repository sometimes includes 'intermediate' data objects, like the data used to generate the figure, that would be re-generated in a 'from scratch' reproduction, while at other times not including any ability to access other core 'intermediate' objects, like the training data library (see below). (For instance, I don't see any reference to where "best_model" created by
dl_train/DL_training.py
is used any other script -- presumably that trained model is used in generating the variousdata/ml_preds
in each of thetest_models
andtest_empirical
sub-dirs, but I couldn't find out where that happens...)Training data
Consider archiving the training data. While the code provided can potentially replicate the training data, this takes some time and may not be fully reproducible. After attempting to reproduce this, a zip of the entire
output
directory generated for 500,000 time series weighed in at 5.6 GB; not tiny but not in the range that would be difficult to archive freely on a public repository. (e.g. the CERN-backed database Zenodo will archive up to 50 GB files with a permanent DOI).rb
tor
in some of the text file parsers -- perhaps encoding issues due to Windows/Unix differences?)DL Training script
The training script refers to a hard-coded absolute path to an archive that doesn't exist. It would make more sense to align the scripts into a clean workflow, e.g. either have the training script refer directly to the relative path
training_data/output
as generated by therun_job.sh
script there, or have therun_job.sh
generate the zip archive at the relative path assumed indl_train/DL_training.py
.It's a bit unclear how batches are to be dealt with. I believe the intent is to read in all files regardless of which batch they are generated in. For the groups.csv and out_labels.csv, this probably means that those files in each batch need to be stacked in the same order each time. The code doesn't handle this, presumably shows only a single-batch logic?
Perhaps more importantly, it looks like the training script provided does not utilize the hyperparameters reported in the paper. For instance, the code appears to use hyperprameters corresponding to 1 CNN layer and no LSTM layers, while the reported architecture actually used for the results I think suggests you used 1 CNN layer and 2 LSTM layers, if I've followed it correctly?
Lastly, though I stuck with installing dependencies from your requirements.txt, I could not get tensorflow to recognize the
val_acc
monitor or metric, but could get code to excute only when that was commented out. (see commit-diffs in my fork)Running the trained model
This is perhaps the most important bit and also where I'm currently stuck, which may just be my unfamiliarity with certain aspects of the toolset.
One of the most obvious applications of this work would be to run the trained agent on other data. I think the archive should include a copy of the trained agent (the
best_model_1_1.pkl
folder) created by yourdl_train
scripts. Ideally, I think the repository would include a python script which takes data in the required format and uses the trained agent to report the classification probabilities -- exactly how this step is done is still entirely unclear to me, though presumably is involved in generating the data in the variousml_preds
directories. Being able to reproduce/evaluate the performance of the trained classifier on arbitrary timeseries residuals is at least as important as generating the training data or training the classifier, but the provided code to do so seems more opaque to me on this point.More Minor Issues
requirements.txt
(and were not automatically pulled in as an upstream dependency when runningpip install -r requirements.txt
)scikit-learn
pandas
matplotlib
plotly
kaleido
These things probably mostly don't depend on the versions that much anyway, but still would be good to have. Still, it would be better to provide a more comprehensive requirements.txt with the versions of all packages used by a fresh virtualenv in regenerating these results.
Additionally, it would be relatively straight-forward to include an installation script for auto-07p. I have added that to my fork.