EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.72k stars 1.57k forks source link

Project Documentation Enhancement #32

Closed rasbt closed 8 years ago

rasbt commented 8 years ago

I was thinking that it may be worthwhile setting up a project documentation page other than this github repo -- for example, via Sphinx or MkDocs. This would have the advantage to create & organize an API documentation and tutorials/examples. I could set up something like at http://rasbt.github.io/biopandas/ if you'd find it useful.

rhiever commented 8 years ago

What's the advantage over a standard README? How tough is it to maintain?

On Tuesday, November 24, 2015, Sebastian Raschka notifications@github.com wrote:

I was thinking that it may be worthwhile setting up a project documentation page other than this github repo -- for example, via Sphinx or MkDocs. This would have the advantage to create & organize an API documentation and tutorials/examples. I could set up something like at http://rasbt.github.io/biopandas/ if you'd find it useful.

— Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/32.

Randal S. Olson, Ph.D. Postdoctoral Researcher, Institute for Biomedical Informatics University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson https://twitter.com/randal_olson http://www.randalolson.com

rasbt commented 8 years ago

Well, of course you can always put 'everything' into a README file as well, but depending on future additions, this README file can become huge and user unfriendly. I'd say it's the same reason why people don't build websites as 1 large html/text file ... I think for larger projects, breaking it -- the documentation -- down into logical sections (e.g., one document to list and describe version changes, one to document the API, and several ones for tutorials/examples) wouldn't hurt. I think that a README file is important though, it should certainly contain the most important information about a project.

rhiever commented 8 years ago

Doesn't hurt to have the web page docs then. I don't think the project is large enough to merit that yet, but we will probably get there soon.

On Wednesday, November 25, 2015, Sebastian Raschka notifications@github.com wrote:

Well, of course you can always put 'everything' into a README file as well, but depending on future additions, this README file can become huge and user unfriendly. I'd say it's the same reason why people don't build websites as 1 large html/text file ... I think for larger projects, breaking it -- the documentation -- down into logical sections (e.g., one document to list and describe version changes, one to document the API, and several ones for tutorials/examples) wouldn't hurt. I think that a README file is important though, it should certainly contain the most important information about a project.

— Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/32#issuecomment-159643964.

Randal S. Olson, Ph.D. Postdoctoral Researcher, Institute for Biomedical Informatics University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson https://twitter.com/randal_olson http://www.randalolson.com

pronojitsaha commented 8 years ago

Hi @rhiever & @rasbt, I am quite interested & motivated by the possibilities & impact potential of TPOT. If possible, I would like to contribute to it and I think starting with the project documentation would be good, if you require? Look forward to hear from you guys.

Thanks.

rasbt commented 8 years ago

@rhiever Yes, I was also thinking more in terms of "in the long run." It would certainly help though to start early and document "as we go."

If we were to set up a project documentation, we probably want to use a static html builder like Sphinx, MkDocs, or Jekyll. I think it's typical for Python projects to use Sphinx. It's really a neat tool, but it's also a pretty complex beast, and personally, I find that the default themes are really clunky and ugly. I think MkDocs would work just fine and I don't see any disadvantage of using Markdown over the restructured text format.

Once it's setup, it's actually pretty easy to maintain:

  1. make a change in the markdown file(s)
  2. view the changes live via mkdocs serve
  3. build the HTML via mkdocs build --clean
  4. deploy the changes to Gihub-Pages via mkdocs gh-deploy

That's basically it.

rhiever commented 8 years ago

I would be happy for you two to take the helm on establishing the project docs. Once I get back on Monday, I'll be focusing on development again.

On Wednesday, November 25, 2015, Sebastian Raschka notifications@github.com wrote:

@rhiever https://github.com/rhiever Yes, I was also thinking more in terms of "in the long run." It would certainly help though to start early and document "as we go."

If we were to set up a project documentation, we probably want to use a static html builder like Sphinx, MkDocs, or Jekyll. I think it's typical for Python projects to use Sphinx. It's really a neat tool, but it's also a pretty complex beast, and personally, I find that the default themes are really clunky and ugly. I think MkDocs would work just fine and I don't see any disadvantage of using Markdown over the restructured text format.

Once it's setup, it's actually pretty easy to maintain:

  1. make a change in the markdown file(s)
  2. view the changes live via mkdocs serve
  3. build the HTML via mkdocs build --clean
  4. deploy the changes to Gihub-Pages via mkdocs gh-deploy

That's basically it.

— Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/32#issuecomment-159689865.

Randal S. Olson, Ph.D. Postdoctoral Researcher, Institute for Biomedical Informatics University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson https://twitter.com/randal_olson http://www.randalolson.com

rasbt commented 8 years ago

@rhiever @pronojitsaha Alright, sounds like a plan. I suggested to setup the MkDocs framework with API generator and stuff as I've done this for other projects already, but if @pronojitsaha wants to do it, it would be fine with me too. Just let me know so that we don't implement the same thing twice :).

pronojitsaha commented 8 years ago

@rhiever & @rasbt thanks.

@rasbt As you already have a similar framework in place, I dont believe its make sense to reinvent the wheel again! You can share the existing framework as a separate repository and then we can decide the structure and contribute to individual pages as mutually decided. Does that work for you?

rasbt commented 8 years ago

@rhiever @pronojitsaha Sorry for the late response, I took a few days off over the long Thanksgiving weekend. Unfortunately, I am in the midst of wrapping up a few research projects before I I'll go on vacation in a few days so I probably wouldn't get to it before January. But setting up a basic framework via Sphinx or Mkdocs should be pretty straight-forward I guess. The gplearn library is actually a nice, lean example: https://gplearn.readthedocs.org/en/latest/examples.html

I would suggest using the Readme file as a template; I think the goal of the documentation would be a to have an "appealing" with a convenient navigation to find relevant information. I think that it will definitely pay off in the long run when the code base grows (regarding the API documentation) as well as the number of tutorials and examples.

Maybe I'd start with the following sections/pages

pronojitsaha commented 8 years ago

@rasbt Hope you had a good thanksgiving. Ok, I will look into it and setup the initial framework using Mkdocs which we can later work on together once you are available in January. Enjoy the vacation!

rasbt commented 8 years ago

@pronojitsaha Just got home and read your message; I thought: up the template literally just takes 10 minutes, let's do this ;). See pull request #35

I basically just pasted the sections from the Readme file for now, you can see it live at http://rasbt.github.io/tpot/

(If you fetch or merge it, you can see it live locally by running mkdocs serve from the docs/source directory -- by default it's http://127.0.0.1:8000/.)

So, I guess I'll take a look at the API documentation in January then, but I wanted to set this up so that you guys can maybe write the rest of the documentation and come up with some more examples and tutorials or so in the mean time.

pronojitsaha commented 8 years ago

@rasbt Ok, great! Will dwell into it further.

rhiever commented 8 years ago

Thanks for the great start on these docs. I've merged #35.

rhiever commented 8 years ago

@rasbt, I've been updating the docs for the new export functionality and it takes double the work to update both the README and the docs. Any recommendations to avoid this duplication of labor?

@pronojitsaha, now that we have docs up and running, I can think of a couple things that would be invaluable at this point:

1) Not all of the public TPOT functions are thoroughly documented. fit, score, and export in particular need more documentation since those are the primary functions that people will be using. Currently we have a basic example of using them in the README, but it'd be great to expand on those docs and go into detail on what each function -- and what parameter of each function -- does.

2) More examples are always welcome! Currently we only have the MNIST example from sklearn, but it'd be great to provide code examples of many different types of data sets.

rasbt commented 8 years ago

@rhiever I'd recommend not to cram too much into the README file but focus on the "essentials" like an overview, a quick example, installation, license info, and short contributing info. I would insert a "important links section at the top pointing to the actual documentation then. Otherwise, I'd suggest to just assemble the README.md, e.g.,

cat index.md installation.md contributing.md MNIST_Example.md ... > README.md
pronojitsaha commented 8 years ago

@rhiever Ok, I will look into the two points. I understand at this point we have only implemented for classifications tasks, so for examples following are few data sets in my mind, please let me know your views:

  1. Iris Dataset
  2. Titanic Dataset
  3. Lending Club Data
  4. Facial Keypoint Detection
  5. Forest Cover Type Dataset

However, hardware is a challenge as increasing data set sizes will slow down TPOT considerably and increase the time involvement. This also applies for #41 for unit testing. As such have you thought of having EC2 instances for this project or any other alternative to account for this?

UniqueFool commented 8 years ago

hardware is a challenge as increasing data set sizes will slow down TPOT considerably

FWIW, other Python based GP projects tend to use OpenCL/PyOpenCL to make better use of dedicated CPU/GPU and FPGA resources. In fact, a number are even using CUDA (which is NVIDIA specific)

rhiever commented 8 years ago

For now, I think we'll stick to smaller data sets (e.g., the sklearn MNIST subset) for the examples in the docs. i.e., examples that can be executed and see results in less than 10 minutes. I wouldn't want to require the user to fire up an EC2 instance or hop on a HPCC to run a basic TPOT example.

However, for some use cases it may take several hours to run TPOT -- especially with large data sets -- and I think it would be a good idea to note that in the docs. Perhaps in an "Expectations for TPOT" section of the docs?

UniqueFool commented 8 years ago

Note that OpenCL is just an abstraction mechanism, i.e. the underlying "kernels" (C-like code) will work on CPUs, GPUs and FPGA hardware. Wrappers like pyopencl even hide the nitty gritty details and expose all this flexibility to scripting space, which means that a python script can implement heavy algorithms as "kernels" that will automatically make use of dedicated hardware if available. the only real issue is that OpenCL does not currently lend itself to clustering/distribution.

Since you mention MNIST: I suggest to run a corresponding google search, there are a number of examples where using the GPU instead of the CPU (via OpenCL/CUDA) provided a x100 factor speedup when using the MNIST dataset, e.g. see: http://corpocrat.com/2014/11/09/running-a-neural-network-in-gpu/ (note that this is also using python and skl)

http://www.cs.berkeley.edu/~demmel/cs267_Spr11/Lectures/CatanzaroIntroToGPUs.pdf

pronojitsaha commented 8 years ago

@rhiever Ok, I think it makes sense to work on small/sub sets now and focus more on the implementation of the examples. Will look into it.

bartleyn commented 8 years ago

Anyone working on documenting the pipeline operators and public functions? I've made some significant headway on it, but want to make sure I'm not duplicating labor.

pronojitsaha commented 8 years ago

Hi @bartleyn, I am not working on those at the moment.

rhiever commented 8 years ago

PR #71 is related and still in review (will get to it soon, promise -- I'm back from vacation now), but otherwise I believe that's the only pending change to the docs.