dictyBase / Modware-Loader

Various data munging and loading scripts for genome database
2 stars 1 forks source link

Design of loader #92

Closed cybersiddhu closed 10 years ago

cybersiddhu commented 10 years ago

Preamble

This is more or less my thoughts about how to structure a bulk loader for chado. Majority of the ideas come from writing obo2chado loader. It's still lack the design that i am aiming now, but most of the upcoming one will follow that. And the future idea is to refactor the obo loader to that mold.

Design

Scope and expectation

There should be an object oriented interface for reading data from flat files. That object is expected to be passed along to other classes. For example, for obo2chado loader i have used the ONTO-Perl module.

Database interaction

Probably one of the import one. It's better to have an ORM that supports mutiple backends and bulk loading support. For Perl code, i have used BCS a DBIx::Class class layer for chado database.

Loading in the staging area

This part is supposed to get the data from flat file to the temp tables of RDBMS. To start with, lets assign a class which will manage everything related to this task. Here are the responsibilities that i could think of:

Now lets figure out what kind of information the class needs in order to perform those tasks.

So, lets have a first pass on the interface. First the fields/attributes..

Attributes

my $iter = $data_file->iterator;
while(my $data_row = $iter->next) {
    $staging_loader->add_data($data_row);
    if ($staging_loader->entries_in_cache > $staging_loader->cache_threshold) {
               $staging_loader->load_data;
               $staging_loader->clear_cache;
     }
}

In addition, we also need some helper classes that could have the following responsibilities:

cybersiddhu commented 10 years ago

Migrated to dictybase developers blog.