QC and downstream analysis

eboileau commented 6 years ago

I open this thread to follow-up on minor issues related to the analysis scripts, when running the example dataset (see also #76 ).

~~1. When running create-rpbp-preprocessing-report, call to pdflatex initially fails.~~ I suggest to include graphics path files without extension, in this way the graphics package looks for a supported graphics format automatically. ~~2. Bars are missing in the read length distributions bar plots.~~ ~~3. With show-read-length-bfs, compilation fails.~~

For rep 1, we don't have enough reads to visualize (read length periodicity), unless we set --min-visualization-count very low, say 10. Next release should probably include the full working example with suggested values for parameters/options.

~~Not tested yet with option --create-fastqc-reports.~~

FIX bio_utils.plotting.plot_read_length_distribution, create_rpbp_preprocessing_report and visualize_metagene_profile_bayes_factor (minor changes to plotting options, typos, latex commands modified, redundant lines removed, etc.) The report should now be created no matter which option is selected. This will be added to the next release.

eboileau commented 6 years ago

When running create-rpbp-predictions-report:

The report contains misformatted figures (axes, etc.), in particular the ORF bar plots and length distributions.
With the --show-chisq option, it seems that fraction and reweighting_iterations are missing when calling get_riboseq_predicted_orfs, as a result files (names without the frac-smoothing_fraction.rw-smoothing_reweighting_iterations) are reported as missing.

eboileau commented 6 years ago

Re point 2 above: estimate-orf-bayes-factors returns everything as a BED12+ file with frac-smoothing_fraction.rw-smoothing_reweighting_iterations in the file name, whether we only want the chi square value or not (it will be included by default). However, none of the file names account for the differences between is_chisq_values = [True, False] when selecting the final prediction sets. As a results, all fine names contain the string frac-smoothing_fraction.rw-smoothing_reweighting_iterations. For the QC/analysis, and in particular in create_rpbp_predictions_report, this is problematic with --show-chisq, since this results in a mismatch in the file names. This is only a matter of naming convention, but for consistency should be changed throughout the code.

bmmalone commented 6 years ago

Hi Etienne,

Just to clarify a bit here...

chisq stuff: in the paper, we show the the Bayes factor-based approach is essentially always better than the chi-square test-based approach. Since including those figures doubles the size of the reports (and since they aren't especially informative), I just quit using them. It is entirely possible that some those filenames are not consistent any more.
I tried a lot of different settings on the bar charts, and it is very difficult to select defaults that work for everything. For example, if you use the "auto" selection within mpl (I use "mpl" for matplotlib in lots of places), then visual comparison among different experiments can be misleading since the height of the bar always looks the same regardless of sequencing depth in the experiment. Similarly, if the sequencing depth is very low and the min-visualization-count filter removes everything, that is a pretty clear indication that the biological protocol failed for some reason. Likewise, using auto, if one sample is sequenced much deeper than all the others, then those bars will be very compressed. log scale is not very good since it distorts proportions.

My general approach was to create the reports using the defaults (for rpbp only, ignoring the chisq results), and then use the ipython notebooks to fine-tune plots of interest. You can then recompile the latex report to use the updated plots. Of course, this requires quite a lot manual work, so making that easier would certainly be welcome.

Especially for the example, though, you are right that the defaults are not very good. In principle, the various reporting parameters could be passed either through the command line or, maybe even better, through the config file. You'd need to update the relevant scripts to pull the values from the config file, but an advantage is that good values could already be included in the example config file.

eboileau commented 6 years ago

Hi Brandon,

In the same order:

Based on your comments, would you suggest completely removing the chisq stuff from the ORFs predictions (and hence from the analysis)? In the minimum, I would have to patch for the filenames, so that anyone using this test/score (for whatever reason) does not encounter errors.
I'm trying to start with the simplest stuff... the idea is to have QC/analysis scripts that works well in general (and in particular for the example, at least from a software release point of view), however I do not want to invest too much time, since this may not be the most important aspects. I tested the different options, to make sure all of them result in a successful outcome (i.e. without error or with no major formatting issues). This was not the case with all options. Any further improvements will be done in time.

As for the test example, indeed something should be done in the lines of #76 . I could probably use TestRpBp.py as a starting point, though I remember you mentioned something (not sure if this is what you were referring to?). I could then include the example-specific parameters in there, or else as you mention using the config file (and update the relevant scripts).

eboileau commented 6 years ago

Ok, first the chi stuff has been relegated to cases where the option chi_square_only is given, so we don't have all these files generated by default. As for the post-proc analysis (reports and plots), the reports now generate without any issues, but further testing/fine-tuning will be necessary. On this matter, I am also updating the docs for the QC/analysis, so will close this issue for now.

dieterich-lab / rp-bp

QC and downstream analysis #87