ACDguide / BigData

Working with big/challenging data collections
https://ACDguide.github.io/BigData
Other
5 stars 5 forks source link

Reorganise sections? #69

Open dougiesquire opened 1 year ago

dougiesquire commented 1 year ago

I've been doing a full read-through of the book to try and decide where a new section on "recommendations when using conda" might best fit. I'm wondering if some reorganisation of sections might help users and contributors. I've had a stab at an example draft outline below to start some discussion.

Please note that I'm not wedded to this proposal at all, but I thought I should mention to the team that as the current structure evolves, I'm starting to find it difficult to know where to look for specific things.


Overview - as is, but add table-of-contents providing details of what each section aims to do.

Introduction - new section pulling info from a few existing sections and providing context for what's to come. Very high-level concepts like "there are lot's of computation and storage resources in Australia", "for Big Data, data storage and compute must be considered together", "for Big Data it's best to utilise compute close to the data"... Include some of the list of platforms here (https://acdguide.github.io/BigData/platforms/platforms-intro.html), but save the "analysis environments" (e.g. OOD, ARE) for the Computation section below

Data storage - overview taken from https://acdguide.github.io/BigData/data_storage.html#about-large-scale-data

Computation - generic overview taken from https://acdguide.github.io/BigData/computations.html#general-tools, https://acdguide.github.io/BigData/computations.html#command-line-tools and https://acdguide.github.io/BigData/computations.html#other-languages-matlab-r-etc

Resources - taken from https://acdguide.github.io/BigData/resources.html#resources


I think this includes all the existing sections/information (and adds a few). I'm not sure about the division into "Data storage" and "Computation" sections because they're so heavily related. I'm interested to hear what people think (@paigem, @hot007, @paolap). Please do feel free to tell me that this is not needed.

P.S. There are also a number of "hints" and "asides" scattered throughout the book. I wonder if putting these boxes would help with clarity (e.g. https://acdguide.github.io/BigData/accessing_data.html#working-with-authorised-catalogues)?

dougiesquire commented 1 year ago

Just noticed that some of this may already have been addressed in #65

paolap commented 1 year ago

We should probably merge that, we were waiting for Paige review, than we can use it at starting point for further changes, the structure can definitely be improved and as you said as we add content some of the initial structure might not make sense anymore.

paolap commented 1 year ago

Now that I'm reading this more carefully, formatting was addressing inconsistencies at notebook level not such a restricting, I like the idea of restructuring and I'm also finding difficult to separate some of the new sections from the data storage and the computation part. I'll like to have a go at a draft roughly following Dougie suggestions in a separate branch so we can then share it at the meeting on Thursday otherwise it might be tricky to visualise this.

paolap commented 1 year ago

After starting the process I'm wondering if we should have 3 broad categories

Data structure including:

Analysis/computations including:

Platforms/tools basically anything that defines a working environment, the platform available, the intake, pangeo, etc data collections, the packages and pre-defined software environments. Including:

Resources section to close, as it is but making sure it includes all the materials we are listing elsewhere, there's already a section for example workflows in there.

As said before I'm trying to get an example of this before Thursday, as it might help us visualise the final product and make it easier moving sections around

dougiesquire commented 1 year ago

Thanks @paolap. Yes, it could be more extensible to have the 3 categories you suggest. One thing I notice with your new proposal is that some highly-related sections are now quite separated (e.g. analysis tasks and software environments). So we'd want to make sure we link things clearly in the text. Also it's not clear to me what constitutes a "tool". For example, the section on dask sits in "Analysis/computations" in your proposal, not in "Platforms/tools". Maybe we can come up with some clear definitions to help future contributors?

The big difference I see between my and your suggestions is that mine includes a "platforms/tools" subsection within each of the "data" and "computation" sections, whereas yours breaks this out into a new section. I see pros and cons to both. Perhaps we could see what others think in our upcoming meeting?

paolap commented 1 year ago

Yes more clarity would be great, it is probably the terms I'm using that are making the two approaches look more different than they are.

Also my approach is constantly shifting the more I try to fit what we got so far into some sort of structure. So what I'm currently trialling is a bit different from what I've written which stemmed from an attempt to apply your suggestions :-)

The "tools" section as it is currently it is meant as a list of useful software that can be linked (mostly they're in a glossary form) from other part of the books. So while there are some comparisons between software falling in the same category, there are not actual examples on how to use any of them. For example, in my approach, dask would have an introduction in the "tools" section, (as it has currently) but also a page with "dask in practice" examples and tips in the analysis/computation section.

Similarly chunking appears in data format as an introduction to the topic, but then it will be expanded/demonstrated in the computations (as Scott basically as already done in his notebook) and in the dask examples section.

It will take a while to find a good structure, I'm aware that the changes I'm trying to get together in a separate branch might not work, and will end up in a potential waste of time, but I'm finding really hard to think of a different structure without actually moving files around or even sections of text from one file to another.

dougiesquire commented 1 year ago

Great - thanks for having a stab at something. I think that's a great way to start and we can iterate from there if we want to

paigem commented 1 year ago

I really like the ideas here for reorganizing this book! Thank you @dougiesquire and @paolap for pushing these ideas forward!

I think it will be easier for me (and others) to give feedback if we can see the updates that @paolap is making in the book, so I'll hold off on comments until then. Thanks for getting a working example of this going @paolap!

Thomas-Moore-Creative commented 1 year ago

Thanks for pushing your restructured branch @paolap > https://github.com/ACDguide/BigData/tree/restructure_paola The instructions and local build worked fine for me.

Can we recap what we think the next steps are? A further discussion of the proposed new structure here in this issue #69 ?

paolap commented 1 year ago

Steps from here are:

One thing we all seems to agree is that we need clear overviews at the start of the book and at the start of each chapter. So potential users can come up to speed. So far we individuated 3 possible kind of users:

Might be nice to show were possible two approaches for examples we are showing, one maybe less efficient but simpler to adopt, and more advanced example for experienced users.

Other comments on building a book and my branch that I sent via email: For help building the jupyter book locally:

https://github.com/pabloinsente/jupyter-book-tutorial and https://jupyterbook.org/en/stable/start/your-first-book.html

There’s a few warnings that will pop up the first time you build the book, they can usually be ignored, subsequently warnings are repeated only for the files you actually modified. If you think the books isn’t showing what you expect, then clear up previous builds, basically in BigData/BigData folder:

rm -rf _build/

Finally, I tried not to remove any content, just move it around, but I might have missed out something and there are a few things I added:

Thomas-Moore-Creative commented 1 year ago

@paolap - FYI just subbed a trivial PR for typos, largely as a test of my GitNoob skills PRing from a patched non-main branch in a fork.

dougiesquire commented 1 year ago

Just noting that I am planning to have a stab at a reorganised structure early next week (probably building off @paolap's branch) - sorry for the delay in getting to this!

dougiesquire commented 1 year ago

I started going through and trying to reorganise according to my initial comment in this issue and now I feel your pain @paolap! I'm not sure it's sensible for us to all try and reorganise the book, as this is very fiddly/time-consuming and it will be very difficult to consolidate our attempts. Instead, perhaps it's more feasible for us to all go through the current main structure and take notes on what we like/don't like, what could be added/clarified/moved, etc. Then we can compare notes in the next meeting, see where we overlap, and try to come up with a structure that best addresses everyones comments. I'd be happy to implement whatever we arrive at after the meeting.

I think the structure has improved substantially since opening this issue thanks to @paolap's effort. My notes are:

I think this restructure exercise will be most effective/easiest if we can get multiple people comparing what works or doesn’t for them. @hot007, @paigem, @Thomas-Moore-Creative, @AliciaTak, might you guys have time to make your own notes prior to our next meeting on Sept 8th (your notes might simply be "I like it exactly as it is", which would be great).

hot007 commented 1 year ago

Okay, I'm going to have a crack too, but I don't think I'm familiar enough with all the content to have strong preferences, but based on a read through of what we've got here's my thoughts. I am not sure we need a major restructure though I'm happy to be lead by the majority, I tend not to be able to see massive alternatives for the trees in front of me :)

paolap commented 1 year ago

Just a note on this: "Data chunking again, this is best done as a concepts page - needs fleshing out and spell checking but that's okay. I don't think we should do the stdev/min/max/%ile etc stuff here, I don't think it adds anything - I would remove "common tasks" onward." these page and the time one were generated as copies of the computations one, some of the content was left there as an example of how to format in the same way as the original notebook, no content so far is relevant.

hot007 commented 1 year ago

Hah, that explains a lot! Please ignore me then :D

paigem commented 1 year ago

@dougiesquire your (and everyone who's contributed) restructure looks great!! Excited to discuss it more at our meeting today.

dougiesquire commented 1 year ago

Hi all. @paolap and I had a play about on Miro as we discussed in our last meeting. It looks like it could be a handy tool for visualising the book structure and planning any reorganisation.

I've set-up key levels of the current book structure as a "sitemap". I think this works quite well as we can add notes, tags and assignees to each level. I've also had a first stab at a reorganised structure.

If you want to check them out before our next meeting, let me or @paolap know and we can email you the Miro link. Otherwise, hopefully we can use Miro to collectively arrive at a good structure in our next meeting!