GeneMANIA / pipeline

GeneMANIA data processing pipeline
1 stars 1 forks source link

simplify organism data merging #19

Open kzuberi opened 9 years ago

kzuberi commented 9 years ago

A current bottleneck in the pipeline processing is merging separate individual organism datasets into a single multi-organism dataset. Historically the data design placed items like genes and networks for all organisms into shared database tables identified by an internal ID. This created a unified ID space for looking up the data, and causes interdependencies between the organisms when building a dataset. The revised data pipeline allows independent builds of the independent organisms, and adds a merge step that takes care of the interdependencies.

However the merge step results in duplication of data (just re-indexing files to map to a unified id space), wasting time and disk-space during the build process, as well as requiring the re-execution of data processing steps that follow the merge step (the merge step runs at the generic_db level).

This could be improved in a couple of ways:

The 'right' way (but could affect users, requires care):

The 'wrong' way (but won't affect users):