Analysis cleanup - Githubissues

ggsun commented 5 years ago

As we have discussed in the whole-cell meeting on October 25th, we will be initiating a team-wide effort to look through all analysis plots and do a general cleanup. This cleanup will include

Dividing up analysis scripts into differently tagged groups based on their purposes and uses - e.g. “core” (plots that should always be run), “paper” (plots that were used as figures for paper submissions), "metabolism" (plots that are helpful for looking at metabolism specifically).
Removing/fixing analysis scripts that are computationally expensive or not useful.
Attempting to fix broken scripts, and determining their fate.
Making sure the file names of plots match the names of the script files they are generated from.
Disabling the running of cohort analyses on single-seed simulations, if these analyses generate meaningful results only for multiseed sims. Likewise, disabling the running of multigen analyses on single-gen simulations, if these analyses generate meaningful results only for multigen sims.

Here's a link to a shared spreadsheet that contains everything we've discussed so far on each plot: https://docs.google.com/spreadsheets/d/1YIzY5QaxxqzHZ3ifWp_4OsMB77InU0ggmNKDTP1G_K0/edit?usp=sharing

Assignees for each plot/task will be finalized after we look through single plots for our meeting next week.

jmason42 commented 5 years ago

Some related thoughts and side discussions about the tag system:

Not everything needs a tag. No tag means that it can only be run manually, or...
Some command (like --all) should run every script.
core should be the default - and maybe core ought to be default.
A big complicated script selection mini-language isn't needed, but I would like to be able to, at the very least, set up something like --include core metabolism --exclude intensive or --all --exclude paper. (If you want to get really slick, you could express these as set operations e.g. (core | metabolism) & ~intensive or ~paper.)
On that note, anything with big output or heavy computation should get an intensive tag.
There could be some 'hard' or 'inherited' tags related to the type of analysis script. E.g. every multi-gen analysis script gets a multigen tag by default.
Tags should follow our Python variable name conventions, even if that's not how they are used. Excluding special characters and white space simplifies parsing. Underscores are optional and capitalization should be used sparingly (e.g. trna_charging, tRNAcharging, tRNA_charging, trnacharging are all acceptable - I personally prefer the third and then the first, don't care much for the other two).
Tags should be useful. If tag isn't ever used (either to include or exclude a set of scripts), it should be retired. Tags aren't a replacement for documentation or maintenance.
There should be a --dry-run option that only prints the name of the scripts that satisfy the tag parsing rules.
Likewise there should be an option that just prints out the names of all tags (perhaps sorted and annotated from most to least occurrences). Possibly even a warning if a tag only occurs once, since that's a sign that something may be mistyped.

Even more tangential, and probably worth its own issue: I do like the fact that one script, one output file means that (apart from the extension) the file names can be the same. For multiple outputs we can be strict about using the original file name, plus a suffix. We could also say that anything with multiple instead generates a directory with the same name as the script, and that directory then contains the various output files (with no limits on naming). That would be cleaner although it would break the flat directory structure, which @1fish2 has pointed out is desirable.

1fish2 commented 5 years ago

To add to these good ideas:

Let's use prefixes (not subdirectories) for classes with multiple output plots. That's easier for viewing (e.g. via Quick Look and Mojave gallery view) and for shell scripting.
Plot classes currently write multiple formats: PDF, PNG, SVG, and some HMTL. Let's make interactive runs write just one of them and nightly builds write all of them for testing. (Nuke the HTML outputs?) Keep the subdirectories for the different formats so it remains easy to view all the plots in one format.
Consider keeping notes on deleted plots that might be useful to crib from in the future since deleted code is well hidden in source control.
Commented-out plots are only run by explicitly naming them on the command line. It's a temporary state for plots that need fixing or agreement to delete.
The nightly build should continue to run all plots that aren't commented out.
Make the default set be the "core" plots that people want to run manually during development.
Subsets can be implemented "bottom up" via tags like in nose tests or "top down" via lists, with minor tradeoffs like whether the order is controllable. The scripts currently implement one list per category (single, cohort, etc.) and they accept a list of plots on the command line. So more lists can be created by simple shell scripts or adding command line args that name Python lists. Or add command line args that select plots by tag.
The command line help could name and explain all the sets if they're coded into the analysis scripts or quickly discoverable at runtime. I wouldn't do that if it requires loading all analysis plots to discover their tags.
As John wrote, make tags (sets) that are useful for development, not for documentation. The need is just to look at more than the default plots without having to run all plots. The more plot classes and output formats we delete and optimize, the less it matters to skip the rest.
I'd take a wait-and-see approach on a way to include/exclude multiple sets. The first step towards that could be to deduplicate given plot lists.

jmason42 commented 5 years ago

Plot classes currently write multiple formats: PDF, PNG, SVG, and some HMTL. Let's make interactive runs write just one of them and nightly builds write all of them for testing. (Nuke the HTML outputs?)

Sounds good to me. If I had to choose one, I'd go with PDF. PNG is raster, so it has those limitations. SVG support is imperfect, both on the ends of reading and writing, though I could be swayed. PDF is proprietary but widely supported and largely consistent. If we keep multiple formats (which I think is OK), then might I suggest that the topmost plot output directory not contain anything other than the 'pdf', 'png', etc. directories? Then those directories are clean. It does add a layer, which is unfortunate.

prismofeverything commented 5 years ago

This is great, I am already plotting a lightweight tag utility and these are great feature requests.

If I had to choose one, I'd go with PDF. PNG is raster, so it has those limitations. SVG support is imperfect, both on the ends of reading and writing, though I could be swayed. PDF is proprietary but widely supported and largely consistent.

I would prefer SVG over PDF. SVG is a universal format and can be displayed directly in webpages, so could serve the purposes of both a PDF and HTML. You can also open them directly in editable vector programs like Illustrator or Inkscape (just like a PDF). Also, the files are human readable and writable unlike PDF which is a binary mess.

Have you had trouble reading and writing SVG files from python? It is basically markup just like HTML and I have used HTML utilities in the past to write SVG. I can do a survey of the existing SVG libraries and see what the state of the world is.

1fish2 commented 5 years ago

@prismofeverything, I'm interested in the lightweight tag utility. Does it require finding and loading all the classes on every run? Lists don't, so I'm liking that approach more as I think about it.

prismofeverything commented 5 years ago

I'm interested in the lightweight tag utility. Does it require finding and loading all the classes on every run? Lists don't, so I'm liking that approach more as I think about it.

Hey @1fish2, I was thinking something like a class decorator, kind of like how nosetests does it (pass in a list of tags). Nosetests seems to boot pretty fast, even with our large codebase. I think it could be fairly low overhead, especially compared to the runtime of the analyses.

I agree that lists would be easier to implement, but they would be harder to maintain, add or remove from. I want to avoid any kind of situation where a list we forgot still contains a reference to an analysis that no longer exists. If we have a tag system, the only file you would have to edit would be the analysis implementation itself, whereas with a list system you would have to touch a potentially unbounded number of files and it would be easy to mess up. A tag system would just "do the right thing" if you removed that file entirely, no vestigial traces would be scattered throughout the rest of the code.

It is a tradeoff between "ease of use" and "ease of implementation" and since this is a convenience anyway it seems like we are designing something we want to maximize the "ease of use" dimension. Those are my thoughts anyway, open to other perspectives of course (as in, perhaps the development time is not worth the improvement?)

1fish2 commented 5 years ago

Yes, I suspect the development work (including adding all the tags) is more than the payoff.

tahorst commented 5 years ago

We might also want to consider size and computation time when deciding on the file format. I noticed it takes a long time for some plots to save the SVG and PDF versions. For example, transcriptionEvents with 8 gens took me 2:43 to complete, vs 1:47 with PDF disabled, 1:14 with SVG disabled and just 23 seconds with both disabled. the SVG output is also 79.5 MB vs 4.5 MB for the PDF file. This makes me think PDF is a better default but would limit us if we're trying to embed in webpages, which I don't think should be a priority at this time. Either way, settling on just one format should save us a decent amount of computation time.

1fish2 commented 5 years ago

Proposal: Everyday manual runs only need one output format for viewing, selected for (1) enough resolution to review manually, then (2) analysis speed, then (3) space. Use a command line option or a nightly build to get the other formats, e.g. SVG to embed in a web page or PDF to print.

jmason42 commented 5 years ago

@tahorst: The plot type used in that script is pretty weird - I've never used an 'event plot' before but it looks like they create a crazy number of short lines (in this context, I'm guessing a worst case scenario of millions of lines). A heat map might be a better replacement, though I'm only really guessing what this plot looks like. Regardless, I've noticed similar extreme differences between SVG and PDF file sizes.

prismofeverything commented 5 years ago

That is a kind of ridiculous size disparity, especially because svg is a subset of pdf. It sounds like a bad translation implementation. What svg library are we using?

How do the other plots compare? Is the size disparity comparable?

On Fri, Nov 9, 2018, 5:25 PM jmason42 <notifications@github.com wrote:

@tahorst https://github.com/tahorst: The plot type used in that script is pretty weird - I've never used an 'event plot' before but it looks like they create a crazy number of short lines (in this context, I'm guessing a worst case scenario of millions of lines). A heat map might be a better replacement, though I'm only really guessing what this plot looks like. Regardless, I've noticed similar extreme differences between SVG and PDF file sizes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CovertLab/wcEcoli/issues/373#issuecomment-437544900, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAjd2mvrKFX7Uv20Wu8SYXAQl8H4guyks5utirygaJpZM4X86zK .

tahorst commented 5 years ago

I assume the default svg library. Don't know if we can specify a different one through matplotlib to get better performance. It looks like the svg versions of most other plots are about 3x the size of the pdf but a handful have higher multiples, generally the larger files.

jmason42 commented 5 years ago

You'd need to write your own backend if you wanted to swap SVG libraries.

I suspect that the PDF format represents numbers as bytes rather than strings, and furthermore, that there's some basic compression of vector objects. (Source diving confirms that the PDF backend is in fact compressing objects via zlib.) The example that @tahorst poses likely contains a large number of easily compressed, repeated line coordinates.

Incidentally, both the SVG and PDF backends appear to largely be written in pure Python (apart from e.g. their implicit matplotlib dependencies).

prismofeverything commented 5 years ago

You'd need to write your own backend if you wanted to swap SVG libraries.

Wouldn't be the first time : P

Yep, just tried this myself on a few plots, gzipped svg output is an identical size to pdf's (at least on the cases I tried):

-rw-rw-r-- 1 spanglry mcovert 6.3M Nov 11 10:16 allReactionFluxes.pdf
-rw-rw-r-- 1 spanglry mcovert 6.3M Nov 12 16:41 allReactionFluxes.svg.gz

Looks like around a third the size in general (started out as 19MB), so perhaps pdf's of vectorized images are just gzipped svg's? I wouldn't be surprised.

So we could gzip our svg's if we really care about output size, but we'd have to gunzip them before viewing. Or tar xzf the whole output directory. Or leave it in pdf but never be able to use the result outside of viewing. Everything is a tradeoff. I personally feel the ability to programmatically open and read/edit/write a file easily is better than having it smaller but opaque to introspection or use in other contexts, and that open standards are better than proprietary binary formats, but this is based on principle rather than any of the details about this particular use case (in which it doesn't seem like we will be editing these files after they have been created).

Either way, the png's are 1/20th the size of either so they win on raw tininess. If our main use case is viewing then the png's are better than the pdf's for that, and keep the svg's for our maximal resolution outputs, then compress/archive the svg directory if we care about size. Since we don't care about viewing the svg (that's what png's are for) it doesn't matter if that directory is compressed.

CovertLab / wcEcoli

Analysis cleanup #373