fastai / nbdev

Create delightful software with Jupyter Notebooks
https://nbdev.fast.ai/
Apache License 2.0
4.88k stars 490 forks source link

Use Cookie Cutter Data Science as a default template for nbdev_new? #344

Closed aronchick closed 3 years ago

aronchick commented 3 years ago

Have you looked at Cookie Cutter Data Science? It's a quite popular open source project (https://drivendata.github.io/cookiecutter-data-science/) that just has a standard repo structure. I wonder, if it aligns, if you could use that as the default structure for nbdev_new.

That would allow github actions built against either to use similar repo assumptions - and may join forces with their community.

Just an open conversation :)

hamelsmu commented 3 years ago

Hi @aronchick

I noted the cookie-cutter directory structure below. While at first this seems compatible with nbdev out of the box where you just change settings.ini, you have to make some additional modifications to make sure nbdev and this project play nicely with eachother. For example:

Once you take care or are aware of these things, you must make the following changes in settings.ini:

nbs_path = notebooks 
lib_path = src

@aronchick are you asking for some automation to do this type of thing automatically? @jph00 is this something that you want to support in nbdev?

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- Make this project pip installable with `pip install -e`
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io
jph00 commented 3 years ago

cookie cutter uses lots of tools, such as tox and sphinx, to cover the functionality that nbdev provides already out of the box. It doesn't make sense AFAICT to support both. I think it would be simpler for people to just use nbdev which handles everything needed to get started.

aronchick commented 3 years ago

My proposal is less about using the tools to install a repo via CCDS, but a tacit agreement to use the same structure. Many (most?) folks won't be using nbdev to start, so they may set up their repo in an arbitrary way. I'm not saying CCDS is "the way" but it's certainly a popular alternative layout.

If you BOTH agreed on a similar format, then we could say "hey everyone, let's use the format for a data science repo, both nbdev and CCDS agree that it's the best." (or similar language)

I'm not suggesting replacing one line of functionality of nbdev_new just the layout of the repo once installed. And the CCDS folks are also more than open to collaborating and aligning.

I'd propose re-opening this.

jph00 commented 3 years ago

Can you be more specific about what folders you suggest creating, for what purposes? Maybe some examples?

igorbrigadir commented 3 years ago

My proposal is less about using the tools to install a repo via CCDS, but a tacit agreement to use the same structure. Many (most?) folks won't be using nbdev to start, so they may set up their repo in an arbitrary way. I'm not saying CCDS is "the way" but it's certainly a popular alternative layout.

I like your idea, but given the extra bits @hamelsmu added, i think a cookiecutter nbdev setup would be more useful as a separate repository, like https://github.com/fastai/nbdev_template (maybe this is what you had in mind all along?) In that case no changes are needed to nbdev itself, you can maintain a cookiecutter / github template repo with those extra adjustments.

aronchick commented 3 years ago

Sure - basically everything in the structure above. By having a well defined docs/model/src/etc folder, that is the default (not required by any stretch!) then TWO very large projects share a common repo layout as the default. That starts to move mountains IMO - for example, you could imagine an nbdev action which looked for (by default) all notebooks in a folder named /notebooks, or autobuilt docs from any directory named docs. It did this because that's the standard place to look for notebooks in both default repo layouts.

Just a series of things like that.

hamelsmu commented 3 years ago

After thinking about this very carefully (and giving extra deference to @aronchick because he is my friend and someone I respect, I still struggle to see how a marriage of CCDS and nbdev would be mutually beneficial. I understand that the motivation (and really appreciate) the idea of boosting the visibility of nbdev by giving it a hook into a popular repository format. That is very thoughtful and I can relate to the excitement of bringing different communities/projects together (which David is really good at doing by the way!).

However, the main issue is that nbdev isn't specific to data science at all! We have been using it for a while to build software like api clients, cli-tools, web utilities for networking, and all kinds of other things. The authors of nbdev certainly view nbdev as a very general software development framework, and this is also how I view it, too. The presence of Jupyter Notebooks and the association with fastai sometimes (understandably) give an impression that this is a software development framework specifically for the data science community. But its really just a tool for writing, distributing and documenting software not aimed at data scientists specifically.

I would even venture to say that many types of ad-hoc exploratory data science work are not a good fit for nbdev initially as nbdev is geared towards writing software, whereas data scientists may not be concerned with or need packaging, documentation, for many types of activities they do where CCDS would make more sense.

Because nbdev and CCDS are orthogonal in their core purposes (literate programming vs. data science), it feels unnatural to try to force a merger. It would put a cognitive load on maintainers to try to figure out what to do with the conflicting parts (like docs), and how to unify certain goals when the goals are quite different.

I hope that helps, happy to continue the conversation, just thought I would clarify some commonly held misconceptions (to the point where I am considering if some of what I said above should be in the documentation re: not just for data science).

aronchick commented 3 years ago

LOL @hamelsmu, I'm just a guy!

Ok, I hear where you are all coming from, but imagine the alternative world. I don't THINK you all care specifically about the layout of a repo, you just chose one that is logical to you all.

E.g. people DO need a repo for /docs, /notebooks, etc etc. There's nothing specific to data science (of course) to these concepts.

I guess my take is that EVERYONE needs some structure and, today, most data scientists spin up whatever makes sense to them, creating folders when it suits them. Which is, generally, FINE. But I think they generally DON'T have opinions either - as evidenced by the fact that they're using nbdev_template as their repo. They just want something logically laid out - that's it.

This now goes to @igorbrigadir's point - let's say there was an international "standard" for how a repo was laid out. I don't think nbdev or CCDS would have any reason to vary from that - they could offer it as an option, but the default would be the same layout. That's kind of what I'm trying to do from the bottom up. So, imagine the following conversation:

1: Hi, @jph00, you build notebooks and repos ALL THE TIME, do you have some recommendations for how I'd lay it out? 2: Yep, we think you should definitely have a dir for your docs 1: Cool, what should I name it and where should I put it? 2: Anywhere you like, but I use /docs 1: K, what's next? 2: Yep, we think you should definitely have a dir for your source. 1: K, what should I name that? 2: Anything you want, I like the directory name /src goes on for some time 1: Woah, that's a lot of dirs, do you have anything that would lay that out for me? 2: Yep, just use nbdev_new and you're golden. 1: This is a really nice layout! Anyone else use this? 2: Yep! This is the standard layout agreed to when you publish papers (by Papers with Code), by Cookie Cutter Data Science, etc etc.

I truly don't think there are any conflicting goals around the layout of the repo. What you do AFTER that, I get it. And I also get that if you're using nbdev for things that AREN'T data science, folders like model may not make sense. But I think most do? I'm not sure what to do about the fact that some dirs will be quite useless - we'll have to figure something out there.

And, to be clear, I'm not thinking about merging the projects at all - sorry if it came off that way! I only thought you could both 'agree' (for some definition of 'agree') on the default layout of repos.

hamelsmu commented 3 years ago

I like how you explained that structure is helpful for people especially those who are beginning something in order to organize their thoughts. That makes sense to me and I like how you champion the cause of those users!

Nbdev doesn’t mandate a particular directory structure, we often change the directory structure from the one that is there by default. For example - by default, there is no notebooks folder! The only folder that exists immediately is a docs folder. Other folders are added by users as they progress or on their own discretion.

I’m afraid nbdev doesn’t have much of an opinion of a directory structure at all. If you pressed me to identify where there might be agreement:

However I would say that the docs and module folders are just artifacts of the limited infrastructure we have today. In an ideal world you wouldn’t need all these separate artifacts if the tools were better (you would only need a special IDE to view, navigate and render one thing). I’m just giving this background to share why we don’t have a strong opinion on directory structure.

So I guess there is nothing to agree on, re: directory structure. However there is lots of substantive things to disagree on that are quite fundamental like:

And docs and tests are a central concern of nbdev, so it’s hard to pretend that elephant doesn’t exist. If someone faithfully adheres to CCDS we would be asking them to rewrite their docs and tests (which could be in many cases the bulk of the work of a programmer)!

BTW I know that I can DM you to talk through this but I’m somewhat hoping this thoughtful exchange might be useful for others

On Sun, Dec 27, 2020 at 1:23 PM David Aronchick notifications@github.com wrote:

LOL @hamelsmu https://github.com/hamelsmu, I'm just a guy!

Ok, I hear where you are all coming from, but imagine the alternative world. I don't THINK you all care specifically about the layout of a repo, you just chose one that is logical to you all.

E.g. people DO need a repo for /docs, /notebooks, etc etc. There's nothing specific to data science (of course) to these concepts.

I guess my take is that EVERYONE needs some structure and, today, most data scientists spin up whatever makes sense to them, creating folders when it suits them. Which is, generally, FINE. But I think they generally DON'T have opinions either - as evidenced by the fact that they're using nbdev_template as their repo. They just want something logically laid out

  • that's it.

This now goes to @igorbrigadir https://github.com/igorbrigadir's point

  • let's say there was an international "standard" for how a repo was laid out. I don't think nbdev or CCDS would have any reason to vary from that - they could offer it as an option, but the default would be the same layout. That's kind of what I'm trying to do from the bottom up. So, imagine the following conversation:

1: Hi, @jph00 https://github.com/jph00, you build notebooks and repos ALL THE TIME, do you have some recommendations for how I'd lay it out? 2: Yep, we think you should definitely have a dir for your docs 1: Cool, what should I name it and where should I put it? 2: Anywhere you like, but I use /docs 1: K, what's next? 2: Yep, we think you should definitely have a dir for your source. 1: K, what should I name that? 2: Anything you want, I like the directory name /src goes on for some time 1: Woah, that's a lot of dirs, do you have anything that would lay that out for me? 2: Yep, just use nbdev_new and you're golden. 1: This is a really nice layout! Anyone else use this? 2: Yep! This is the standard layout agreed to when you publish papers (by Papers with Code), by Cookie Cutter Data Science, etc etc.

I truly don't think there are any conflicting goals around the layout of the repo. What you do AFTER that, I get it. And I also get that if you're using nbdev for things that AREN'T data science, folders like model may not make sense. But I think most do? I'm not sure what to do about the fact that some dirs will be quite useless - we'll have to figure something out there.

And, to be clear, I'm not thinking about merging the projects at all - sorry if it came off that way! I only thought you could both 'agree' (for some definition of 'agree') on the default layout of repos.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fastai/nbdev/issues/344#issuecomment-751517663, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALKJERLMHLORB34HRQWVJTSW6QWDANCNFSM4U6R4IVA .

aronchick commented 3 years ago

And docs and tests are a central concern of nbdev, so it’s hard to pretend that elephant doesn’t exist. If someone faithfully adheres to CCDS we would be asking them to rewrite their docs and tests (which could be in many cases the bulk of the work of a programmer)!

I guess this would be my ideal. Since you don't care about directory structure, but DO care about these things, what if you (or I) went to CCDS and said "hey, CCDS, we're happy to snap to your directory structure when someone does nbdev_init IFF you restructure in the following way?" That would be really compelling, I think!

jph00 commented 3 years ago

We don't have a directory structure because we only use notebooks in the repo root. The idea of a directory structure doesn't make sense for nbdev, since everything (tests, implementation, and docs) is done directly from notebooks. So there are no directories to structure.

aronchick commented 3 years ago

Hey Jeremy!

Sorry, I meant the directory structure used is provided by nbdev_new (https://nbdev.fast.ai/cli#nbdev_new) - this lays down an existing (seemingly compatible with nbdev defaults) repo structure

From: Jeremy Howard notifications@github.com Sent: Monday, January 4, 2021 12:05 PM To: fastai/nbdev nbdev@noreply.github.com Cc: David Aronchick David.Aronchick@microsoft.com; Mention mention@noreply.github.com Subject: Re: [fastai/nbdev] Use Cookie Cutter Data Science as a default template for nbdev_new? (#344)

We don't have a directory structure because we only use notebooks in the repo root. The idea of a directory structure doesn't make sense for nbdev, since everything (tests, implementation, and docs) is done directly from notebooks. So there are no directories to structure.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffastai%2Fnbdev%2Fissues%2F344%23issuecomment-754187713&data=04%7C01%7Cdavid.aronchick%40microsoft.com%7C369f12954e674ce6b71908d8b0ebf9de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637453874855448183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VkgALyQ7MVukd71ObuVkXwIG9vHV0qa%2FX4GfuWqcPTY%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAMQ5IZ6ZVTSFRAUG6RBALSYINNVANCNFSM4U6R4IVA&data=04%7C01%7Cdavid.aronchick%40microsoft.com%7C369f12954e674ce6b71908d8b0ebf9de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637453874855458175%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8bw2lyzKMw6cntf5IPCvL1DOKH2Yz7AEq19HY3mdCqw%3D&reserved=0.

hamelsmu commented 3 years ago

I understand you would like to have a tool that converts CCDS to nbdev somehow (it doesn't matter that I don't understand why). It sounds like that may be important to your work. If that is the case, I would recommend making a cli tool that does the things you want outside nbdev, that will be the best way to unblock you.

We don't think nbdev is the right place to set or bless directory structures, only because this doesn't have anything to do with nbdev. nbdev_new is only one way to get started and is just a minimal set of files in a completely arbitrary format. Because of this, we don't want to add any kind of extra maintenance burden either cognitively or in practice that has to do with keeping any specific structure in mind.

If you want to discuss how you might make this CLI feel free to ping me

aronchick commented 3 years ago

I'm happy to ping - I'm just saying for JUST the nbdev_new tool, the layout you choose for the structure to be compatible with CCDS. No conversion, no compatibility, no other commands need to support it. You already have chosen a structure, I'm just asking that That's all!

To be clear again, when I type: nbdev_new foobaz

It creates the following:

foobaz
├── 00_core.ipynb
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── docker-compose.yml
├── docs
│   ├── Gemfile
│   ├── Gemfile.lock
│   ├── _data
│   │   ├── ...
│   ├── _includes
│   │   ├── ...
│   ├── _layouts
│   │   ├── ...
│   ├── css
│   │   ├── ...
│   │   ├── fonts
│   │   │   ├── ...
│   │   ├── ...
│   ├── feed.xml
│   ├── fonts
│   │   ├── ...
│   ├── images
│   │   ├── ...
│   ├── js
│   │   ├── ...
│   ├── licenses
│   │   ├── ...
│   ├── ...
├── index.ipynb
├── settings.ini
└── setup.py

Instead, I'd propose it installed the following (which is CCDS compatible) or something like it:

foobaz
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── docker-compose.yml
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│   ├── Gemfile
│   ├── Gemfile.lock
│   ├── _data
│   │   ├── ...
│   ├── _includes
│   │   ├── ...
│   ├── _layouts
│   │   ├── ...
│   ├── css
│   │   ├── ...
│   │   ├── fonts
│   │   │   ├── ...
│   │   ├── ...
│   ├── feed.xml
│   ├── fonts
│   │   ├── ...
│   ├── images
│   │   ├── ...
│   ├── js
│   │   ├── ...
│   ├── licenses
│   │   ├── ...
│   ├── ...
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
│   └──  00_core.ipynb
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- Make this project pip installable with `pip install -e`
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
├── index.ipynb
├── settings.ini
├── setup.py
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

I'm a total outsider to this project - if this feels weird or wrong, I'm sorry.