This is more or less my thoughts about how to structure a bulk loader for chado. Majority of the ideas come from writing obo2chado loader. It's still lack the design that i am aiming now, but most of the upcoming one will follow that. And the future idea is to refactor the obo loader to that mold.
Design
Scope and expectation
The input would be some sort of flat file.
The data will be loaded to a relational backend. It could definitely be generalized, but at this moment it is not considered.
Reading data
There should be an object oriented interface for reading data from flat files. That object is expected to be passed along to other classes. For example, for obo2chado loader i have used the ONTO-Perl module.
Database interaction
Probably one of the import one. It's better to have an ORM that supports mutiple backends and bulk loading support. For Perl code, i have used BCS a DBIx::Class class layer for chado database.
Loading in the staging area
This part is supposed to get the data from flat file to the temp tables of RDBMS. To start with, lets assign a class which will manage everything related to this task. Here are the responsibilities that i could think of:
Create temp tables.
Create indexes/constraints in temp tables as necessary.
Drop temp tables if necessary(probably not needed).
Load data in those temp tables, should be in bulk mode. If there are multiple data sections going to different temp tables and they are independent then loading could be parallalized.
Provide some sort of interface to give a ballpark about number of rows in those temp tables.
Remember, there will be a separate manager class for each backend. However, they should share a identical interface.
Now lets figure out what kind of information the class needs in order to perform those tasks.
A ORM/Database object for all database centric tasks. If its an ORM, then it should better provide access to some bulk mode operation or at least low level objects for bulk support.
So, lets have a first pass on the interface. First the fields/attributes..
Attributes
schema:
chunk_threshold: I kind of thrown this in, it will be used for bulk loading in chunk. For details, check the methods section below.
Methods
create_tables:
drop_tables:
create_indexes:
load_data:
add_data: This would be more or less to add a row of data_object to the manager. It will cache the data unless it is above threshold and load_data is invoked.
my $iter = $data_file->iterator;
while(my $data_row = $iter->next) {
$staging_loader->add_data($data_row);
if ($staging_loader->entries_in_cache > $staging_loader->cache_threshold) {
$staging_loader->load_data;
$staging_loader->clear_cache;
}
}
In addition, we also need some helper classes that could have the following responsibilities:
Allows to back and forth information from database
Data transformation.
Managing data caches.
However, these are not set in stone and could vary from loader to loader. But its important to share the helper classes for different backend specific manager class. So, the helper classes should have a defined interface.
Preamble
This is more or less my thoughts about how to structure a bulk loader for chado. Majority of the ideas come from writing
obo2chado
loader. It's still lack the design that i am aiming now, but most of the upcoming one will follow that. And the future idea is to refactor theobo
loader to that mold.Design
Scope and expectation
Reading data
There should be an object oriented interface for reading data from flat files. That object is expected to be passed along to other classes. For example, for
obo2chado
loader i have used the ONTO-Perl module.Database interaction
Probably one of the import one. It's better to have an ORM that supports mutiple backends and bulk loading support. For Perl code, i have used BCS a
DBIx::Class
class layer for chado database.Loading in the staging area
This part is supposed to get the data from flat file to the temp tables of RDBMS. To start with, lets assign a class which will manage everything related to this task. Here are the responsibilities that i could think of:
Now lets figure out what kind of information the class needs in order to perform those tasks.
So, lets have a first pass on the interface. First the fields/attributes..
Attributes
Methods
data_object
to the manager. It will cache the data unless it is above threshold andload_data
is invoked.In addition, we also need some helper classes that could have the following responsibilities: