cerebis / meta-sweeper

Parametric sweep of simulated microbial communities and metagenomic sequencing.
GNU General Public License v3.0
10 stars 0 forks source link

Adopt a space efficient graph interchange format #47

Open cerebis opened 7 years ago

cerebis commented 7 years ago

Though flexible, GraphML is not a space-efficient format (shock! XML). This becomes a real consideration with larger projects. Some particularly compact formats exist (Sparse6) but often make a compromise by excluding many facilities.

Frequent limitations include:

Without these abilities, information would have to be moved to secondary files. Does this accomplish anything in terms of storage?

Side-effects:

Short-Term As a short-term solution, GML is a more compact text format and supports what we require. It is actually more readable that GraphML as it has little cruft.

Long-Term The adoption of surrogate keys requires storing of details such as source genomic entity (name, coords). This is because there will come a time in a workflow when "where" things came from will matter to the researcher. Adopting finer-scale graphs, such as SNV based, would have a large effect on graph order, making a change to integer surrogates increasingly beneficial and in the reduction of scale, the nodes themselves become increasingly abstract anyhow. I.E. the names dealt with now (contig names) are comfortable, but will not be a long-term endpoint in HiC analysis.

koadman commented 7 years ago

are you already using gzio? how much smaller are the graph xml getting with compression?

cerebis commented 7 years ago

I've not got the numbers but it's significant. Later versions of Nextworkx will un/compress a file automagically if you end its name in gz.

Gml is a lot smaller. It's a lot like the gains between json and xml. You can also use bare yaml or json. Probably discard a lot of new lines etc.

The compactness achieved with Sparse6 is amazing, though no support for real numbered edge weights is a big limitation.

I'll put up some file sizes for interest sake. Thinking about things though, this topic is bigger than just persistent storage footprint.

On Saturday, 1 October 2016, Aaron Darling notifications@github.com wrote:

are you already using gzio? how much smaller are the graph xml getting with compression?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cerebis/meta-sweeper/issues/47#issuecomment-250905159, or mute the thread https://github.com/notifications/unsubscribe-auth/AFuni0MKY6AmefLyS7MOTf-30pSma7fFks5qvjX8gaJpZM4KLm6V .

koadman commented 7 years ago

great. does that mean we can just gzip the graphml and get enough savings to make it work for big bird? (hehe) seems simplest that way.

cerebis commented 7 years ago

Graphml is sufficing for now, its just that it is clearly not what you'd have settled on with benefit of hindsight.

On Sunday, 2 October 2016, Aaron Darling notifications@github.com wrote:

great. does that mean we can just gzip the graphml and get enough savings to make it work for big bird? (hehe) seems simplest that way.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cerebis/meta-sweeper/issues/47#issuecomment-250947949, or mute the thread https://github.com/notifications/unsubscribe-auth/AFuni33cWItZSuD-VP8qpyJn3DNlpUKJks5qvwg3gaJpZM4KLm6V .