biocore / American-Gut

American Gut open-access data and IPython notebooks
Other
108 stars 81 forks source link

Preprocessing notebook #126

Closed jwdebelius closed 9 years ago

jwdebelius commented 9 years ago

A notebook for the generation of clean, consistent tables for use in downstream analyses. Rather than spending half a notebook trying to get data in a desired format, this centralizes the process.

Notebook can be viewed here.

This relies on files in #125.

jwdebelius commented 9 years ago

Bumping this, since the tables are being distributed. @wasade, @ElDeveloper , @cuttlefishh: Could you review this?

The

wasade commented 9 years ago

few comments below:

beta diversity likely includes archaea as well

boolian -> boolean

I don't think pandas was described as a requirement for the notebook

beta diversity hyperlink in "beta diversity parameters" doesn't work

ag is picked against Greengenes 13_8

greengenes -> Greengenes

in split directories and files, why are the underscores escaped? eg split_raw_dir. this happens in multiple places when describing variables, like file pattern fill-in, etc

why does the notebook download the preprocessed data if the point is to produce the preprocessed data?

the hyperlinks for the references don't work, look like they're formatted incorrectly

jwdebelius commented 9 years ago

I only listed the requirements that were not superseded by QIIME or are ambitious in QIIME. (You need the newest IPython, but the version of pandas, matplotlib, etc. work fine with the QIIME dependencies.) Would it clarify things to explicitly list these packages?

I think the issue is mixing markdown and HTML. It worked in my Safari and Chrome, but it seems to be a problem in NBViewer. I will correct the issue.

The escapes are there because it explicitly forces the underscores to be displayed, rather than italicizing the remaining text. So, I've put them in explicitly to maintain the formatting.

I can't set up an explicit way to download the FTP as a hyperlink in an IPython notebook. Before the data was FTP, I had an in-text link to the HTML. I can assume that if people want to download the data, they can run other notebooks, but I sort of imagined that people looking at the notebook would want to get the data for other things.

On Mon, Mar 30, 2015 at 8:17 AM, Daniel McDonald notifications@github.com wrote:

few comments below:

beta diversity likely includes archaea as well

boolian -> boolean

I don't think pandas was described as a requirement for the notebook

beta diversity hyperlink in "beta diversity parameters" doesn't work

ag is picked against Greengenes 13_8

greengenes -> Greengenes

in split directories and files, why are the underscores escaped? eg split_raw_dir. this happens in multiple places when describing variables, like file pattern fill-in, etc

why does the notebook download the preprocessed data if the point is to produce the preprocessed data?

the hyperlinks for the references don't work, look like they're formatted incorrectly

— Reply to this email directly or view it on GitHub https://github.com/biocore/American-Gut/pull/126#issuecomment-87719926.

wasade commented 9 years ago

Hyperlinks work with ftp, just use: ftp://ftp.microbio.me/foo/bar.

Pandas is not a dep of any of those IIRC, but might be wrong On Mar 30, 2015 11:37 AM, "J W Debelius" notifications@github.com wrote:

I only listed the requirements that were not superseded by QIIME or are ambitious in QIIME. (You need the newest IPython, but the version of pandas, matplotlib, etc. work fine with the QIIME dependencies.) Would it clarify things to explicitly list these packages?

I think the issue is mixing markdown and HTML. It worked in my Safari and Chrome, but it seems to be a problem in NBViewer. I will correct the issue.

The escapes are there because it explicitly forces the underscores to be displayed, rather than italicizing the remaining text. So, I've put them in explicitly to maintain the formatting.

I can't set up an explicit way to download the FTP as a hyperlink in an IPython notebook. Before the data was FTP, I had an in-text link to the HTML. I can assume that if people want to download the data, they can run other notebooks, but I sort of imagined that people looking at the notebook would want to get the data for other things.

On Mon, Mar 30, 2015 at 8:17 AM, Daniel McDonald <notifications@github.com

wrote:

few comments below:

beta diversity likely includes archaea as well

boolian -> boolean

I don't think pandas was described as a requirement for the notebook

beta diversity hyperlink in "beta diversity parameters" doesn't work

ag is picked against Greengenes 13_8

greengenes -> Greengenes

in split directories and files, why are the underscores escaped? eg split_raw_dir. this happens in multiple places when describing variables, like file pattern fill-in, etc

why does the notebook download the preprocessed data if the point is to produce the preprocessed data?

the hyperlinks for the references don't work, look like they're formatted incorrectly

— Reply to this email directly or view it on GitHub <https://github.com/biocore/American-Gut/pull/126#issuecomment-87719926 .

— Reply to this email directly or view it on GitHub https://github.com/biocore/American-Gut/pull/126#issuecomment-87765751.

jwdebelius commented 9 years ago

When I tried to make them in IPython, the hyperlink hover-over is a curser, not the little link-click hand. I looked over what I could find in StackOverflow, python and IPython documentation, but I couldn't find a satisfactory solution for linking to an ftp in the markdown cells.

Pandas is a scikit-bio dependency AFAIK.

ElDeveloper commented 9 years ago

Ah yeah, this is a known problem with Markdown.

On (Mar-30-15|10:58), J W Debelius wrote:

When I tried to make them in IPython, the hyperlink hover-over is a curser, not the little link-click hand. I looked over what I could find in StackOverflow, python and IPython documentation, but I couldn't find a satisfactory solution for linking to an ftp in the markdown cells.

Pandas is a scikit-bio dependency AFAIK.


Reply to this email directly or view it on GitHub: https://github.com/biocore/American-Gut/pull/126#issuecomment-87773239

jwdebelius commented 9 years ago

With that in mind, how would you prefer to handle this?

The solutions I can come up with include something like the current implementation, where a flag gets set to allow download or dataset generation; removing a reference to the pre-generated dataset; or including a link somewhere, with the note that it needs to be copied into the browser.

I think all three have advantages and drawbacks.

On Mon, Mar 30, 2015 at 11:00 AM, Yoshiki Vázquez Baeza < notifications@github.com> wrote:

Ah yeah, this is a known problem with Markdown.

On (Mar-30-15|10:58), J W Debelius wrote:

When I tried to make them in IPython, the hyperlink hover-over is a curser, not the little link-click hand. I looked over what I could find in StackOverflow, python and IPython documentation, but I couldn't find a satisfactory solution for linking to an ftp in the markdown cells.

Pandas is a scikit-bio dependency AFAIK.


Reply to this email directly or view it on GitHub: https://github.com/biocore/American-Gut/pull/126#issuecomment-87773239

— Reply to this email directly or view it on GitHub https://github.com/biocore/American-Gut/pull/126#issuecomment-87774165.

jwdebelius commented 9 years ago

This fixes the markdown issues with appearance and the links.

The download flag is still included in the notebook.

I'd like to get this merged sooner because it reflects tables we're sending out with the manuscript. It can live in my repo indefinitely, but it would be better to have it in master.

ElDeveloper commented 9 years ago

That notebook is :star2: amazing :star2:!

wasade commented 9 years ago

@ElDeveloper, merge?

jwdebelius commented 9 years ago

@ElDeveloper: Any update on review?

Thank you for your help!

ElDeveloper commented 9 years ago

@JWDebelius and I are going through the notebook, we hope to have a final version around 4:00 pm PDT today.

jwdebelius commented 9 years ago

Running a little behind.

Thank you for all the awesome help today, @ElDeveloper.

I had one question: you suggested applying the scipy.spatial.distance.euclidean function when calculating the best rarefaction. When I try to apply the function, it requires a 1D array. My understanding is that performance wise, it's better to operate on 2D numpy arrays than to do a list comprehension and then cast to an array, but I'm happy to do which ever?

ElDeveloper commented 9 years ago

My main suggestion for using the euclidian distance function in scipy, was to improve the readability of those few lines of code, not really to improve performance. However, if you think this is not ideal and cannot be easily adapted, then it's find to leave as is.

On (Apr-13-15|20:20), J W Debelius wrote:

Running a little behind.

Thank you for all the awesome help today, @ElDeveloper.

I had one question: you suggested applying the scipy.spatial.distance.euclidean function when calculating the best rarefaction. When I try to apply the function, it requires a 1D array. My understanding is that performance wise, it's better to operate on 2D numpy arrays than to do a list comprehension and then cast to an array, but I'm happy to do which ever?


Reply to this email directly or view it on GitHub: https://github.com/biocore/American-Gut/pull/126#issuecomment-92579202

ElDeveloper commented 9 years ago

Thanks @JWDebelius