biocore / American-Gut

American Gut open-access data and IPython notebooks
Other
108 stars 81 forks source link

WIP: Preprocessing revitalized #160

Closed wasade closed 8 years ago

wasade commented 8 years ago

First pass. This has at its core a different approach to processing. The central idea is to not tie to a compute resource which greatly reduces the amount of code necessary to manage compute. Instead, the model is that the notebooks will assume they have full access to hammer on the machine they're executing on. This way, we can rely on making direct system calls (i.e., !print_qiime_config.py) as well as QIIME's parallelism within a single host through just forking multiple processes. On a compute cluster, the expectation is that an individual would request an entire node for a reasonable amount of walltime to do the work.

wasade commented 8 years ago

...also, this encourages much smaller notebooks which is refreshing to say the least.

mortonjt commented 8 years ago

I think this is mostly okay. There are a few comments in pick_otus.

gregcaporaso commented 8 years ago

Remove the recruited bloom sequences from the demultiplexed sequence data

Why do this at the sequence level and not the OTU level? (And if this isn't the kind of feedback you're looking for now, just let me know.)

rob-knight commented 8 years ago

Because doing it at the OTU level was too broad in earlier tests, though perhaps this should be revisited.

On Oct 7, 2015, at 4:17 PM, Greg Caporaso notifications@github.com wrote:

Remove the recruited bloom sequences from the demultiplexed sequence data

Why do this at the sequence level and not the OTU level? (And if this isn't the kind of feedback you're looking for now, just let me know.)

— Reply to this email directly or view it on GitHub https://github.com/biocore/American-Gut/pull/160#issuecomment-146362589.

gregcaporaso commented 8 years ago

We also need to specify what specific metadata category and value correspond indicate what samples are fecal. extra words here

gregcaporaso commented 8 years ago

The bloom sequences are used as reference sequences for 97% closed-reference OTU picking, and then all sequences that cluster into the OTUs defined by those centroids are removed. So the "range" of sequences that are removed is the same size as if you did this at the OTU level, but the centroids are specifically defined by the bloom sequences. This is interesting and makes sense, but I think it'd be good to describe this specifically (where you mention how sequences are recruited above) as it wasn't the way that I would have thought to do this.

gregcaporaso commented 8 years ago

One really high-level comment on the notebooks - if these were written in markdown instead of ipynb they'd be easier to diff, have others contribute to, etc. This is what we're doing with IAB now (see my MSL blog post about this).

wasade commented 8 years ago

That is awesome. Yes, will look at that right now

On Wed, Oct 7, 2015 at 5:31 PM, Greg Caporaso notifications@github.com wrote:

One really high-level comment on the notebooks - if these were written in markdown instead of ipynb they'd be easier to diff, have others contribute to, etc. This is what we're doing with IAB now (see my MSL blog post https://www.mozillascience.org/an-introduction-to-applied-bioinformatics-at-mozsprint about this).

— Reply to this email directly or view it on GitHub https://github.com/biocore/American-Gut/pull/160#issuecomment-146366718.

gregcaporaso commented 8 years ago

This video shows how readers can submit changes as PRs 100% through their web browser.

gregcaporaso commented 8 years ago

Reviewed get_sequences_and_metadata.ipynb, that all makes sense.

gregcaporaso commented 8 years ago

In pick_otus.ipynb, you shouldn't need to pass -r or -t anymore - they're defaulting to the files you're wanting to use (I think).

gregcaporaso commented 8 years ago

It could help with debugging if your two pick_closed_reference_otus.py commands were in different cells.

gregcaporaso commented 8 years ago

@wasade, added some comments on your notebooks. These generally look good to me (clearly a lot better than the massive notebook you linked me to). If you want me to review a few more of these as you get them together let me know.

wasade commented 8 years ago

Not fully, I want to be able to exercise these notebooks via travis so I need to replace the reference used as GG 97% isn't feasible for testing due to 15min indexing time with SortMeRNA (and not able to incl. in repo due to size)

On Wed, Oct 7, 2015 at 5:38 PM, Greg Caporaso notifications@github.com wrote:

In pick_otus.ipynb, you shouldn't need to pass -r or -t anymore - they're defaulting to the files you're wanting to use (I think).

— Reply to this email directly or view it on GitHub https://github.com/biocore/American-Gut/pull/160#issuecomment-146367625.

wasade commented 8 years ago

Thank you @gregcaporaso, this is awesome. I'm going to finish doing the indirection to get the testing up for pick OTUs, tie into .travis, and then do the ipymd bit you linked.

gregcaporaso commented 8 years ago

Ah, right, that makes sense (though you could probably add a .qiime_config to travis if you wanted to, but no difference either way).

On Wed, Oct 7, 2015 at 4:40 PM, Daniel McDonald notifications@github.com wrote:

Not fully, I want to be able to exercise these notebooks via travis so I need to replace the reference used as GG 97% isn't feasible for testing due to 15min indexing time with SortMeRNA (and not able to incl. in repo due to size)

On Wed, Oct 7, 2015 at 5:38 PM, Greg Caporaso notifications@github.com wrote:

In pick_otus.ipynb, you shouldn't need to pass -r or -t anymore - they're defaulting to the files you're wanting to use (I think).

— Reply to this email directly or view it on GitHub <https://github.com/biocore/American-Gut/pull/160#issuecomment-146367625 .

— Reply to this email directly or view it on GitHub https://github.com/biocore/American-Gut/pull/160#issuecomment-146367972.

squirrelo commented 8 years ago

Couple small things.

wasade commented 8 years ago

In the interest of simplifying review, as this PR is titanic, I'm going to close it out and open individual PRs for each notebook one-by-one.