PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

create-downloads: can we split BioPAX by organism, data source, etc. faster? #212

Closed IgorRodchenkov closed 8 years ago

IgorRodchenkov commented 9 years ago

It's already a multi-thread process (see this and that methods).

We'd use 20 or 30 threads in the ExecutorService instead of current 10. We'd also script to run the process on different machines at the same time...

Emek and Igor have just discussed another option. At first glance it seems that, at least, we could try to produce and save by data source BioPAX files (sub-models) during the Merge stage instead of during -create-downloads. For this, we'd modify the Merger to process (match and merge with our UniProt+ChEBI BioPAX Warehouse) data sources independently and save intermediate result files before merging them altogether into the main model.

Still, we'd have to generate "Detailed" and by-organism BioPAX files during -create-downloads anyway. But this is not the only problem with the above optimization idea.

The main problem is that, by design, "data source" in Premerge and Merge (and on the PC2 Providers page) means a cpath2 Metadata entry (referred by metadata.identifier, e.g., "reactome_human", "intact_complex_human", etc., having standard name, such as "Reactome", "IntAct"); whereas the "data source" for the purpose of splitting the main model into archives in the downloads, or for full-text search (using filter: &datasource=reactome), or to show access logs, in fact, means a data provider, organization, or authority (as defined by its standard name, e.g., "IntAct", "PANTHER Pathway"), regardless how we configured PC2 Metadata (e.g., "IntAct" or "PANTHER Pathway" could be imported using either one or many metadata entries; and we actually prefer the latter, i.e., - having metadata.identifiers like: "panther_human", "panther_mouse", "intact_human", "intact_complex_human", because this way allows for applying very specific data Cleaners).

IgorRodchenkov commented 9 years ago

Builds ok now.