Improved documentation and Q2 summary visualizations; centralizing parameters

fedarko commented 5 years ago

An attempt to summarize new stuff:

A lot of changes to the README. I'll summarize a few of the important things here, but it's probably best if you just read over the new README.
- Reorganized things: add quick examples of running Songbird standalone + in Q2 to the top of the README, both examples using the Red Sea dataset (and using parameters that I've previously ensured make the model fit reasonably well). Both examples stress the importance of checking model fit.
- Added explicit sections on "Interpreting model fitting information", "Adjusting parameters to get reasonable fitting", and "Specifying a formula". A lot of these sections is cobbled together from stuff that was previously buried in the FAQs.
- Added screenshots of tensorboard and the q2 summarize-single visualizations for the same dataset, with the same parameters. (These URLs will be un-broken when this PR is merged in—for now, the screenshots can be viewed here (tensorboard) and here (q2).)
- Please double check to make sure that these plots look good to use as reference screenshots for "good" model fitting. I'm pretty sure they're good, but I didn't write Songbird :) If they need changing, I can change up the parameters or whatevs.
- Restructured the FAQs into separate sections: added more detailed explanations of the Q2 outputs, cleaned up some of the existing text, etc.
- Added a few sentences shilling Qurro ;) (mostly where the README is explaining what you can do with the differentials)
Modified the q2 summary visualizations to be a lot easier to interpret:
- Switched around the cv and loss plots to match tensorboard (CV on top, loss on bottom)
- Renamed "loglikehood" to loss for the now-bottom plot, to be consistent with tensorboard
- Added in-HTML text explaining:
  - the need to decrease --p-summary-interval if you don't see anything in the plots
  - how to interpret the plots (this links to the new "Interpreting..." section of the README)
  - how to adjust parameters to get models to fit reasonably (this links to the new "Adjusting..." section of the README)
- Tidied up formatting of the pseudo Q² value for paired summaries (the plots aren't uncomfortably shifted to the right now :P)
- These changes should close #72.
Removed the --i-feature-table parameter for qiime songbird summarize-single, since the number of samples is only used to compute Q² values in the paired summary. (Closes #76.)
- This necessitated making n an optional parameter for _summarize(). As you can see in the diff, I added some code to check for the never-should-happen case where n is None but baseline is not None (i.e. we're making a paired summary but don't have access to n for computing Q² values). _summarize() throws an error in this case.
- I also added a test to make sure that error is triggered properly.
All descriptions and defaults for all parameters (of multinomial, at least) are now stored in songbird/parameter_info.py. This centralized information is used by both the standalone and Q2 versions of Songbird, so now if you want to change a parameter you only need to change one thing.
- (This is the same way we solved this sort of problem in DEICODE and Qurro.)
- Making all of these changes messed up the styling of scripts/songbird, so instead of spending a bunch of time manually formatting it I just ran the file through black -l 79. (That's why that file looks so different in the diff.)
- Following up on that, I lowered the default summary interval from 60 seconds to 10 seconds. Since most people's first experiences with Songbird will likely be running a small model that stops quickly, using a smaller default summary interval should help them see something. (Closes #77.)

OK, I think that's the bulk of it. Please let me know if you'd like any clarification or for me to change anything. In sum, I think these changes will make using Songbird a lot more pleasant (and understandable) for many people. Thanks!

fedarko commented 5 years ago

Also, it'd probably be a good idea for multiple people to review this since this is a lot of changes, and it's important that they're correct. @lisa55asil if you'll be around tomorrow, would you mind sitting down and going over this with me? A lot of the documentation changes here involved your FAQs, so you'd probably be the most qualified person to comment on that :)

mortonjt commented 5 years ago

@fedarko , this is awesome! Will take a close look at it today / tomorrow.

fedarko commented 5 years ago

Thanks! @lisa55asil is looking over it right now, so I think we can merge this in once we incorporate her feedback.

One quick question—there's this part of the FAQ:

Q. What's up with the --training-column argument?

A. That is used for cross-validation if you have a specific reproducibility question that you are interested in answering. If this is specified, only samples labeled "Train" under this column will be used for building the model and samples labeled "Test" will be used for cross validation. In other words the model will attempt to predict the microbe abundances for the "Test" samples. The resulting prediction accuracy is used to evaluate the generalizability of the model in order to determine if the model is overfitting or not. If this argument is not specified, then 10 random samples will be chosen for the test dataset. If you want to specify more random samples to allocate for cross-validation, the --num-random-test-examples argument can be specified.

This says "10 random samples", but the default for --num-random-test-examples is 5. Should we change that?

fedarko commented 5 years ago

Ok, Lisa's changes have been integrated in. We've both made some changes (I fixed the --num-random-test-examples thing, as well as tidied up the discussions of # of iterations). Also tried to clean up the "baseline model" description in the summarize-paired help here.

@mortonjt, pending your final approval I think this should be good to merge in.

fedarko commented 5 years ago

...made a few last-minute commits showing off using qiime metadata tabulate on a differentials file, but i think that's it from me for right now :)

lisa55asil commented 5 years ago

That's awesome! If its easy and you already have it a link to the output of tabulating the differentials.qza from the Red Sea data would be helpful I think.

On Mon, Sep 23, 2019 at 3:31 PM Marcus Fedarko notifications@github.com wrote:

...made a few last-minute commits showing off using qiime metadata tabulate on a differentials file, but i think that's it from me for right now :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/pull/80?email_source=notifications&email_token=AE6EV4WU2TFHC6UYXCRBDWDQLE7SLA5CNFSM4IYSYWZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7MO2TA#issuecomment-534310220, or mute the thread https://github.com/notifications/unsubscribe-auth/AE6EV4UX7Y5V7ROPYBL4BI3QLE7SLANCNFSM4IYSYWZA .

fedarko commented 5 years ago

Ok! I'm not sure adding a link to a QZV is the best idea, since I guess that'd have to be periodically updated if we also update Songbird... but I added a screenshot of the QZV output. I think that's a good advertisement for that functionality :) (here's a link to the screenshot)

mortonjt commented 5 years ago

Noting qiime metadata tabulate is a great idea!

mortonjt commented 5 years ago

@fedarko are you finished with edits?

fedarko commented 5 years ago

Going to make one more pass over everything today. I'll try to have it done by 5pm EST.

Thanks! Marcus

On Tue, Sep 24, 2019, 9:00 AM Jamie Morton notifications@github.com wrote:

@fedarko https://github.com/fedarko are you finished with edits?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/pull/80?email_source=notifications&email_token=AA736PYLQZ5MMMHB364XZY3QLI2TPA5CNFSM4IYSYWZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7O4EAA#issuecomment-534626816, or mute the thread https://github.com/notifications/unsubscribe-auth/AA736P4Y5QAKAM273AMFUITQLI2TPANCNFSM4IYSYWZA .

fedarko commented 5 years ago

@mortonjt I'm done! Added a small subsection clearly describing setting the reference for categorical formulas, since this has confused a lot of people (me included) in the past.

Should be ready to merge in—thanks @lisa55asil + @mortonjt for your help with this :)

lisa55asil commented 5 years ago

This is looking dope Marcus! I really like the clarification about the reference for categorical formulas. I added a short sentence that I think helps make this description more tangible.

On Tue, Sep 24, 2019 at 2:01 PM Marcus Fedarko notifications@github.com wrote:

@mortonjt https://github.com/mortonjt I'm done! Added a small subsection clearly describing setting the reference for categorical formulas, since this has confused a lot of people (me included) in the past.

Should be ready to merge in—thanks @lisa55asil https://github.com/lisa55asil + @mortonjt https://github.com/mortonjt for your help with this :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/pull/80?email_source=notifications&email_token=AE6EV4XZ6EMQNRAACDYSSLDQLJ527A5CNFSM4IYSYWZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7PZK6A#issuecomment-534746488, or mute the thread https://github.com/notifications/unsubscribe-auth/AE6EV4WIQYV5KM4JVLNRLYLQLJ527ANCNFSM4IYSYWZA .

fedarko commented 5 years ago

@mortonjt I'm done with editing, so provided you're cool with it I think this is ready to go. Thanks!

mortonjt commented 5 years ago

Awesome. Thanks @fedarko !

biocore / songbird

Improved documentation and Q2 summary visualizations; centralizing parameters #80