Reorganise sections? - Githubissues

dougiesquire commented 2 years ago

I've been doing a full read-through of the book to try and decide where a new section on "recommendations when using conda" might best fit. I'm wondering if some reorganisation of sections might help users and contributors. I've had a stab at an example draft outline below to start some discussion.

Please note that I'm not wedded to this proposal at all, but I thought I should mention to the team that as the current structure evolves, I'm starting to find it difficult to know where to look for specific things.

Overview - as is, but add table-of-contents providing details of what each section aims to do.

Introduction - new section pulling info from a few existing sections and providing context for what's to come. Very high-level concepts like "there are lot's of computation and storage resources in Australia", "for Big Data, data storage and compute must be considered together", "for Big Data it's best to utilise compute close to the data"... Include some of the list of platforms here (https://acdguide.github.io/BigData/platforms/platforms-intro.html), but save the "analysis environments" (e.g. OOD, ARE) for the Computation section below

Data storage - overview taken from https://acdguide.github.io/BigData/data_storage.html#about-large-scale-data

Data standards - new subsection giving high level overview of CF conventions (and others?) and pointing to other resources
Chunking - taken from https://acdguide.github.io/BigData/chunking.html
Data formats - includes subsubsections on netcdf (https://acdguide.github.io/BigData/data_storage.html#netcdf, https://acdguide.github.io/BigData/format_metadata.html, https://acdguide.github.io/BigData/data_storage.html#netcdf-zarr), zarr (https://acdguide.github.io/BigData/data_storage.html#zarr), etc
Analysis-ready data - taken from https://acdguide.github.io/BigData/data_storage.html#advice-on-writing-datasets-for-efficient-use, also subsubsection on Pangeo Forge from https://acdguide.github.io/BigData/data_storage.html#pangeo-forge-an-open-source-framework-for-extraction-transformation-and-loading-of-scientific-data
Tools for streamlining data access - taken from https://acdguide.github.io/BigData/accessing_data.html#methods-of-accessing-data-and-metadata

Computation - generic overview taken from https://acdguide.github.io/BigData/computations.html#general-tools, https://acdguide.github.io/BigData/computations.html#command-line-tools and https://acdguide.github.io/BigData/computations.html#other-languages-matlab-r-etc

Analysis environments - New section including tools like conda, Jupyter and platforms like ARE, OOD (taken from https://acdguide.github.io/BigData/platforms/platforms-nci-ood.html), EasyHub?...
Chunking - "Some tools provide explicitly chunked data models that enable you to distribute your computations across multiple chunks in parallel". Different from the Data storage/Chunking section in that it provides a computational perspective. Show Scott's great examples using dask (https://acdguide.github.io/BigData/computations.html#common-tasks)
Tools for efficient computation - taken from https://acdguide.github.io/BigData/tools/intro.html#available-tools-and-their-best-use
Examples - New section for including example notebooks/scripts??

Resources - taken from https://acdguide.github.io/BigData/resources.html#resources

I think this includes all the existing sections/information (and adds a few). I'm not sure about the division into "Data storage" and "Computation" sections because they're so heavily related. I'm interested to hear what people think (@paigem, @hot007, @paolap). Please do feel free to tell me that this is not needed.

P.S. There are also a number of "hints" and "asides" scattered throughout the book. I wonder if putting these boxes would help with clarity (e.g. https://acdguide.github.io/BigData/accessing_data.html#working-with-authorised-catalogues)?

dougiesquire commented 2 years ago

Just noticed that some of this may already have been addressed in #65

paolap commented 2 years ago

We should probably merge that, we were waiting for Paige review, than we can use it at starting point for further changes, the structure can definitely be improved and as you said as we add content some of the initial structure might not make sense anymore.

paolap commented 2 years ago

Now that I'm reading this more carefully, formatting was addressing inconsistencies at notebook level not such a restricting, I like the idea of restructuring and I'm also finding difficult to separate some of the new sections from the data storage and the computation part. I'll like to have a go at a draft roughly following Dougie suggestions in a separate branch so we can then share it at the meeting on Thursday otherwise it might be tricky to visualise this.

paolap commented 2 years ago

After starting the process I'm wondering if we should have 3 broad categories

Data structure including:

data formats,
metadata, for this we should relate to the Governance book for conventions and general information and write here only about implication of following/breaking the standards
chunking introduction,
anything else that describe the data and/or conversion between data formats

Analysis/computations including:

common analysis tasks (broadly speaking the computations notebook)
working in parallel (NB there's an entire training from Scott on this, we might be able to leverage on)
dask (we mentioned it already but more advanced examples, including chunking aspects)
timeseries specific examples
Machine Learning / gpu ?? it would be empty now, but there's definitely interest and new project going around
...

Platforms/tools basically anything that defines a working environment, the platform available, the intake, pangeo, etc data collections, the packages and pre-defined software environments. Including:

Computing platform, I think it work as it is I'm not convinced we need to separate OOD and ARE, we are discussing jupyter, jupyterlab etc in tools but these are specific examples of working environments
Analysis ready data (as in Dougie's comment)
tools for streamlining data access (as in Dougie's comment)
software environments (we could put it the last addition from Dougie, I currently placed it in tools but possibly that and the Community section from tools: (https://acdguide.github.io/BigData/tools/tools-python1.html#community) should be moved here
tools, NB I added in my example a Julia section as it is gaining popularity

Resources section to close, as it is but making sure it includes all the materials we are listing elsewhere, there's already a section for example workflows in there.

As said before I'm trying to get an example of this before Thursday, as it might help us visualise the final product and make it easier moving sections around

dougiesquire commented 2 years ago

Thanks @paolap. Yes, it could be more extensible to have the 3 categories you suggest. One thing I notice with your new proposal is that some highly-related sections are now quite separated (e.g. analysis tasks and software environments). So we'd want to make sure we link things clearly in the text. Also it's not clear to me what constitutes a "tool". For example, the section on dask sits in "Analysis/computations" in your proposal, not in "Platforms/tools". Maybe we can come up with some clear definitions to help future contributors?

The big difference I see between my and your suggestions is that mine includes a "platforms/tools" subsection within each of the "data" and "computation" sections, whereas yours breaks this out into a new section. I see pros and cons to both. Perhaps we could see what others think in our upcoming meeting?

paolap commented 2 years ago

Yes more clarity would be great, it is probably the terms I'm using that are making the two approaches look more different than they are.

Also my approach is constantly shifting the more I try to fit what we got so far into some sort of structure. So what I'm currently trialling is a bit different from what I've written which stemmed from an attempt to apply your suggestions :-)

The "tools" section as it is currently it is meant as a list of useful software that can be linked (mostly they're in a glossary form) from other part of the books. So while there are some comparisons between software falling in the same category, there are not actual examples on how to use any of them. For example, in my approach, dask would have an introduction in the "tools" section, (as it has currently) but also a page with "dask in practice" examples and tips in the analysis/computation section.

Similarly chunking appears in data format as an introduction to the topic, but then it will be expanded/demonstrated in the computations (as Scott basically as already done in his notebook) and in the dask examples section.

It will take a while to find a good structure, I'm aware that the changes I'm trying to get together in a separate branch might not work, and will end up in a potential waste of time, but I'm finding really hard to think of a different structure without actually moving files around or even sections of text from one file to another.

dougiesquire commented 2 years ago

Great - thanks for having a stab at something. I think that's a great way to start and we can iterate from there if we want to

paigem commented 2 years ago

I really like the ideas here for reorganizing this book! Thank you @dougiesquire and @paolap for pushing these ideas forward!

I think it will be easier for me (and others) to give feedback if we can see the updates that @paolap is making in the book, so I'll hold off on comments until then. Thanks for getting a working example of this going @paolap!

Thomas-Moore-Creative commented 2 years ago

Thanks for pushing your restructured branch @paolap > https://github.com/ACDguide/BigData/tree/restructure_paola The instructions and local build worked fine for me.

Can we recap what we think the next steps are? A further discussion of the proposed new structure here in this issue #69 ?

paolap commented 2 years ago

Steps from here are:

try to come up with your own restructuring locally from the main branch
alternatively you can start from my branch restructure_paola
if you are short on time just add comments to either of them Then we can discuss this again at the next meeting and hopefully come out with at least an initial restructure we all feel could work. As Dougie said we're not going to get it right.

One thing we all seems to agree is that we need clear overviews at the start of the book and at the start of each chapter. So potential users can come up to speed. So far we individuated 3 possible kind of users:

new to climate data analysis
experienced looking for specific advice
familiar with climate science analysis moving to a different system/language/institution etc.

Might be nice to show were possible two approaches for examples we are showing, one maybe less efficient but simpler to adopt, and more advanced example for experienced users.

Other comments on building a book and my branch that I sent via email: For help building the jupyter book locally:

https://github.com/pabloinsente/jupyter-book-tutorial and https://jupyterbook.org/en/stable/start/your-first-book.html

There’s a few warnings that will pop up the first time you build the book, they can usually be ignored, subsequently warnings are repeated only for the files you actually modified. If you think the books isn’t showing what you expect, then clear up previous builds, basically in BigData/BigData folder:

rm -rf _build/

Finally, I tried not to remove any content, just move it around, but I might have missed out something and there are a few things I added:

Machine learning and dask skeleton notebooks
Content on compression in the data section
Julia to the newly renamed software part

Thomas-Moore-Creative commented 2 years ago

@paolap - FYI just subbed a trivial PR for typos, largely as a test of my GitNoob skills PRing from a patched non-main branch in a fork.

dougiesquire commented 2 years ago

Just noting that I am planning to have a stab at a reorganised structure early next week (probably building off @paolap's branch) - sorry for the delay in getting to this!

dougiesquire commented 2 years ago

I started going through and trying to reorganise according to my initial comment in this issue and now I feel your pain @paolap! I'm not sure it's sensible for us to all try and reorganise the book, as this is very fiddly/time-consuming and it will be very difficult to consolidate our attempts. Instead, perhaps it's more feasible for us to all go through the current main structure and take notes on what we like/don't like, what could be added/clarified/moved, etc. Then we can compare notes in the next meeting, see where we overlap, and try to come up with a structure that best addresses everyones comments. I'd be happy to implement whatever we arrive at after the meeting.

I think the structure has improved substantially since opening this issue thanks to @paolap's effort. My notes are:

Overview - remove "Overview" from navbar, add some more introductory content around who the target audience is
Computing Platforms - add paragraph to intro recommending utilising compute close to the data, rather than downloading data
Computing Platforms - Add details about storage options to all relevant platforms (as is done already for NCI)
Methods of accessing data and metadata - Some of this is really written as a generic description of tools at the moment. Could we split out the stuff that is relevant to users wanting to access data on the computing platforms (i.e. the NCI Intake Catalogue atm) and move the rest to Available tools?
Metadata - I’m not sure this heading is necessary. Could we move Data formats, variables and metadata into Data Storage?
Computations - I think this could benefit from some reorganising. My opinion is that the concept of chunking needs to be introduced earlier on. Currently we show lot’s of in-depth computations that use chunking before we talk about what chunking is. Something like the structure I propose in the Computation section of my original post could work?
Resources - I think the expandable headings are unnecessary.

I think this restructure exercise will be most effective/easiest if we can get multiple people comparing what works or doesn’t for them. @hot007, @paigem, @Thomas-Moore-Creative, @AliciaTak, might you guys have time to make your own notes prior to our next meeting on Sept 8th (your notes might simply be "I like it exactly as it is", which would be great).

hot007 commented 2 years ago

Okay, I'm going to have a crack too, but I don't think I'm familiar enough with all the content to have strong preferences, but based on a read through of what we've got here's my thoughts. I am not sure we need a major restructure though I'm happy to be lead by the majority, I tend not to be able to see massive alternatives for the trees in front of me :)

Agree with Dougie, Overview can go or rather be replaced with a paragraph on background of ACDGuide, and 'how to use this book'. In particular call out what is in each section (e.g. in particular that links to doco and examples are found under 'Resources')
Computing platforms: I like this as a standalone section. Agree about recommending taking compute to the data (with the caveat that it may not be viable to co-locate all data so take the computer to the gravitational centre of the data?!). Not sure if we can add much about storage - we can for CSIRO but don't want to get too specific given the audience of this is much wider, can mention a bit more but there's probably a medium seeing we can't readily populate much of the other options, whereas we'd expect all readers to be NCI users so that's the most important one to get right/thorough.
- Index needs cleaning up
- Maybe we should expand the Pawsey section more? Maybe once we're near publication-ready we could send it by their helpdesk with a request that they update and PR that page.
Methods of accessing data and metadata: this doesn't go here. I'd be inclined to put it in the Metadata section. Although this page is not just about metadata, the data access is a result of the metadata enabling cataloguing so it makes sense to me to put them together. Move the middle bit that's NCI-specific to Computations area with other NCI-specific resources.
Data storage - this might be better renamed "data formats and netCDF advice". Should the Pangeo forge information be here or with Computations?
Metadata: data formats, variables and metadata - this does not cover multiple formats, this is specific metadata and variable information for netCDF, rename page accordingly. Otherwise this is in the right place. Can be partnered with the cataloguing page mentioned above in a Metadata section, and maybe we add an additional page "metadata standards" specifically for the call out to the data guidelines book.
Computations with large datasets There's a fair bit here...
- there's a typo in here, 'applys' should be 'applies' in one of the callout boxes
- The first few sections could be broken out and used as the intro to "Tools" perhaps.
- From "Common tasks" onward, keep this notebook intact, but rename "data analysis concepts". Perhaps it should come after the Tools section.
- I think it's okay that this comes before the deep dive into chunking etc, so long as there's clear links/index so that this page is clearly just an intro.
- I don't think it's necessary to provide these demos for every tool, I think this is a neat concepts page and the key is really about the block mapping, we are not here to provide training in how to do every operation in every tool, I think it's fine for this section to be concepts with one or two languages demonstrated. Something about picking the right tool for the job, most languages can do most things and teaching the nuance of when to use CDO or NCO is... not simple?
Data chunking again, this is best done as a concepts page - needs fleshing out and spell checking but that's okay. I don't think we should do the stdev/min/max/%ile etc stuff here, I don't think it adds anything - I would remove "common tasks" onward.
Dealing with the time axis - this should definitely come after tools are introduced. This is a fairly bare section at the moment and I don't know what the intention is, but I'd probably talk generally about how netCDF uses a reference time, calendar vagaries, and probably a mention in the python context of Iris which I find to be particularly recalcitrant on the calendar front but also similarly great for climatologies etc as xarray.
Tools
- we don't mention Ferret except in passing in the intro?
- I still think that this intro (and therefore section) is fine as is but would sit well before Computations instead of after, even though in general you'd introduce concepts before specifics
- Objective of this section - we don't provide example code here, just describe libraries to help raise awareness of what might be useful. For examples, see Resources
- Python - this page needs an index at the top to help navigability. Alternatively, add another level in the page hierarchy and have sub pages for IDEs, package mgmt etc. The callout re not re-installing conda is probably worth elevating to/repeating at top level if you do restructure in this way. At the least, need a tip to use RHS navigation bar.
- at the end of the list of miscellaneous packages (or at the start of the python pages?), add a line saying 'if your favourite python package isn't listed here, raise an issue' with appropriate link.
- Data handling in python - rename "Python data handling" so it's more obvious in the LHS index that there's two python pages
- maybe add xarray-datatree to the bottom set (or does that go with cmip6-preprocessing/xmip/esmvaltool on the Python page?)
Resources - I like how clean this page looks with the drop downs - maybe that would be a way to tidy up the python pages? It'd be nice if we could keep an opening sentence/paragraph under each to preview what's going to be in there, if we were to roll it out more broadly across the book?
- where to get help I think this page is worth keeping even if it gets changed.

paolap commented 2 years ago

Just a note on this: "Data chunking again, this is best done as a concepts page - needs fleshing out and spell checking but that's okay. I don't think we should do the stdev/min/max/%ile etc stuff here, I don't think it adds anything - I would remove "common tasks" onward." these page and the time one were generated as copies of the computations one, some of the content was left there as an example of how to format in the same way as the original notebook, no content so far is relevant.

hot007 commented 2 years ago

Hah, that explains a lot! Please ignore me then :D

paigem commented 2 years ago

@dougiesquire your (and everyone who's contributed) restructure looks great!! Excited to discuss it more at our meeting today.

dougiesquire commented 2 years ago

Hi all. @paolap and I had a play about on Miro as we discussed in our last meeting. It looks like it could be a handy tool for visualising the book structure and planning any reorganisation.

I've set-up key levels of the current book structure as a "sitemap". I think this works quite well as we can add notes, tags and assignees to each level. I've also had a first stab at a reorganised structure.

If you want to check them out before our next meeting, let me or @paolap know and we can email you the Miro link. Otherwise, hopefully we can use Miro to collectively arrive at a good structure in our next meeting!

ACDguide / BigData

Reorganise sections? #69