CovertLab / wcEcoli

Whole Cell Model of E. coli
Other
18 stars 4 forks source link

Analysis cleanup #373

Open ggsun opened 5 years ago

ggsun commented 5 years ago

As we have discussed in the whole-cell meeting on October 25th, we will be initiating a team-wide effort to look through all analysis plots and do a general cleanup. This cleanup will include

Here's a link to a shared spreadsheet that contains everything we've discussed so far on each plot: https://docs.google.com/spreadsheets/d/1YIzY5QaxxqzHZ3ifWp_4OsMB77InU0ggmNKDTP1G_K0/edit?usp=sharing

Assignees for each plot/task will be finalized after we look through single plots for our meeting next week.

jmason42 commented 5 years ago

Some related thoughts and side discussions about the tag system:

Even more tangential, and probably worth its own issue: I do like the fact that one script, one output file means that (apart from the extension) the file names can be the same. For multiple outputs we can be strict about using the original file name, plus a suffix. We could also say that anything with multiple instead generates a directory with the same name as the script, and that directory then contains the various output files (with no limits on naming). That would be cleaner although it would break the flat directory structure, which @1fish2 has pointed out is desirable.

1fish2 commented 5 years ago

To add to these good ideas:

jmason42 commented 5 years ago

Plot classes currently write multiple formats: PDF, PNG, SVG, and some HMTL. Let's make interactive runs write just one of them and nightly builds write all of them for testing. (Nuke the HTML outputs?)

Sounds good to me. If I had to choose one, I'd go with PDF. PNG is raster, so it has those limitations. SVG support is imperfect, both on the ends of reading and writing, though I could be swayed. PDF is proprietary but widely supported and largely consistent. If we keep multiple formats (which I think is OK), then might I suggest that the topmost plot output directory not contain anything other than the 'pdf', 'png', etc. directories? Then those directories are clean. It does add a layer, which is unfortunate.

prismofeverything commented 5 years ago

This is great, I am already plotting a lightweight tag utility and these are great feature requests.

If I had to choose one, I'd go with PDF. PNG is raster, so it has those limitations. SVG support is imperfect, both on the ends of reading and writing, though I could be swayed. PDF is proprietary but widely supported and largely consistent.

I would prefer SVG over PDF. SVG is a universal format and can be displayed directly in webpages, so could serve the purposes of both a PDF and HTML. You can also open them directly in editable vector programs like Illustrator or Inkscape (just like a PDF). Also, the files are human readable and writable unlike PDF which is a binary mess.

Have you had trouble reading and writing SVG files from python? It is basically markup just like HTML and I have used HTML utilities in the past to write SVG. I can do a survey of the existing SVG libraries and see what the state of the world is.

1fish2 commented 5 years ago

@prismofeverything, I'm interested in the lightweight tag utility. Does it require finding and loading all the classes on every run? Lists don't, so I'm liking that approach more as I think about it.

prismofeverything commented 5 years ago

I'm interested in the lightweight tag utility. Does it require finding and loading all the classes on every run? Lists don't, so I'm liking that approach more as I think about it.

Hey @1fish2, I was thinking something like a class decorator, kind of like how nosetests does it (pass in a list of tags). Nosetests seems to boot pretty fast, even with our large codebase. I think it could be fairly low overhead, especially compared to the runtime of the analyses.

I agree that lists would be easier to implement, but they would be harder to maintain, add or remove from. I want to avoid any kind of situation where a list we forgot still contains a reference to an analysis that no longer exists. If we have a tag system, the only file you would have to edit would be the analysis implementation itself, whereas with a list system you would have to touch a potentially unbounded number of files and it would be easy to mess up. A tag system would just "do the right thing" if you removed that file entirely, no vestigial traces would be scattered throughout the rest of the code.

It is a tradeoff between "ease of use" and "ease of implementation" and since this is a convenience anyway it seems like we are designing something we want to maximize the "ease of use" dimension. Those are my thoughts anyway, open to other perspectives of course (as in, perhaps the development time is not worth the improvement?)

1fish2 commented 5 years ago

Yes, I suspect the development work (including adding all the tags) is more than the payoff.

tahorst commented 5 years ago

We might also want to consider size and computation time when deciding on the file format. I noticed it takes a long time for some plots to save the SVG and PDF versions. For example, transcriptionEvents with 8 gens took me 2:43 to complete, vs 1:47 with PDF disabled, 1:14 with SVG disabled and just 23 seconds with both disabled. the SVG output is also 79.5 MB vs 4.5 MB for the PDF file. This makes me think PDF is a better default but would limit us if we're trying to embed in webpages, which I don't think should be a priority at this time. Either way, settling on just one format should save us a decent amount of computation time.

1fish2 commented 5 years ago

Proposal: Everyday manual runs only need one output format for viewing, selected for (1) enough resolution to review manually, then (2) analysis speed, then (3) space. Use a command line option or a nightly build to get the other formats, e.g. SVG to embed in a web page or PDF to print.

jmason42 commented 5 years ago

@tahorst: The plot type used in that script is pretty weird - I've never used an 'event plot' before but it looks like they create a crazy number of short lines (in this context, I'm guessing a worst case scenario of millions of lines). A heat map might be a better replacement, though I'm only really guessing what this plot looks like. Regardless, I've noticed similar extreme differences between SVG and PDF file sizes.

prismofeverything commented 5 years ago

That is a kind of ridiculous size disparity, especially because svg is a subset of pdf. It sounds like a bad translation implementation. What svg library are we using?

How do the other plots compare? Is the size disparity comparable?

On Fri, Nov 9, 2018, 5:25 PM jmason42 <notifications@github.com wrote:

@tahorst https://github.com/tahorst: The plot type used in that script is pretty weird - I've never used an 'event plot' before but it looks like they create a crazy number of short lines (in this context, I'm guessing a worst case scenario of millions of lines). A heat map might be a better replacement, though I'm only really guessing what this plot looks like. Regardless, I've noticed similar extreme differences between SVG and PDF file sizes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CovertLab/wcEcoli/issues/373#issuecomment-437544900, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAjd2mvrKFX7Uv20Wu8SYXAQl8H4guyks5utirygaJpZM4X86zK .

tahorst commented 5 years ago

I assume the default svg library. Don't know if we can specify a different one through matplotlib to get better performance. It looks like the svg versions of most other plots are about 3x the size of the pdf but a handful have higher multiples, generally the larger files.

jmason42 commented 5 years ago

You'd need to write your own backend if you wanted to swap SVG libraries.

I suspect that the PDF format represents numbers as bytes rather than strings, and furthermore, that there's some basic compression of vector objects. (Source diving confirms that the PDF backend is in fact compressing objects via zlib.) The example that @tahorst poses likely contains a large number of easily compressed, repeated line coordinates.

Incidentally, both the SVG and PDF backends appear to largely be written in pure Python (apart from e.g. their implicit matplotlib dependencies).

prismofeverything commented 5 years ago

You'd need to write your own backend if you wanted to swap SVG libraries.

Wouldn't be the first time : P

Yep, just tried this myself on a few plots, gzipped svg output is an identical size to pdf's (at least on the cases I tried):

-rw-rw-r-- 1 spanglry mcovert 6.3M Nov 11 10:16 allReactionFluxes.pdf
-rw-rw-r-- 1 spanglry mcovert 6.3M Nov 12 16:41 allReactionFluxes.svg.gz

Looks like around a third the size in general (started out as 19MB), so perhaps pdf's of vectorized images are just gzipped svg's? I wouldn't be surprised.

So we could gzip our svg's if we really care about output size, but we'd have to gunzip them before viewing. Or tar xzf the whole output directory. Or leave it in pdf but never be able to use the result outside of viewing. Everything is a tradeoff. I personally feel the ability to programmatically open and read/edit/write a file easily is better than having it smaller but opaque to introspection or use in other contexts, and that open standards are better than proprietary binary formats, but this is based on principle rather than any of the details about this particular use case (in which it doesn't seem like we will be editing these files after they have been created).

Either way, the png's are 1/20th the size of either so they win on raw tininess. If our main use case is viewing then the png's are better than the pdf's for that, and keep the svg's for our maximal resolution outputs, then compress/archive the svg directory if we care about size. Since we don't care about viewing the svg (that's what png's are for) it doesn't matter if that directory is compressed.