bio4j / bio4j-titan

Titan-specific bio4j implementation
https://github.com/bio4j/bio4j
6 stars 2 forks source link

Improving performance of the importing process #38

Open pablopareja opened 9 years ago

pablopareja commented 9 years ago

Perhaps we should put in practice some of the recommendations here:

http://s3.thinkaurelius.com/docs/titan/current/bulk-loading.html

@eparejatobes What do you think?

eparejatobes commented 9 years ago

sure, but what are we doing right now?

pablopareja commented 9 years ago

Nothing special I guess... :smiley:

Here is the only configuration detail we set up explicitly:

conf.setProperty("autotype", "none");
eparejatobes commented 9 years ago

you had autotype on?? :fearful:

pablopareja commented 9 years ago

nooo! read it again, "none" :smile:

eparejatobes commented 9 years ago

ok ok sorry I misread

marina-manrique commented 9 years ago

@eparejatobes padawan

pablopareja commented 9 years ago

oi oi oi oi..... :stuck_out_tongue_closed_eyes:

eparejatobes commented 9 years ago

OK so

the other options do not seem that much relevant/useful here.

laughedelic commented 9 years ago

as far as I remember, the non-trivial part here was to decide when is it needed to commit transactions..

pablopareja commented 9 years ago

Well, about that, most of the times transactions must simply be committed as soon as a new element is created. For instance when importing Uniprot, at any time a new vertex such as an Interpro motif, Keyword, etc... is found for the first time, the transaction must be committed. This must be done prior to moving to the next protein, otherwise redundant vertices would be created and everything would be a mess...

eparejatobes commented 9 years ago

I think that there's not a lot of room for improvement with our current import code; any significant reduction on import times should come from rethinking our data import strategy. About this, we should be able to take advantage of the fact that we can reorder our writes as we please. We could do one sequential file read per element type, with the order being determined by a kind of grading on the types. For example, you can first load all protein nodes, so that now when loading all edges incident with them you don't need to worry about any check on that side. In general, this means replacing local commits at the level of instances of types by so to say global ones where you load all the instances of a given type at once, then "commit" (close connection etc).

pablopareja commented 9 years ago

hey hey! I just got some results from this program: ImportUniprotGoTestTitan.java that I implemented to check out the time that was spent only on reading and parsing the XML file (that includes building all the temporal DOM structures needed to extract the specific data at each step of the program) Here you go:

Statistics for program ImportUniprotGoTest: Input file: uniprot_sprot.xml There were 546000 proteins inserted. The elapsed time was: 0h 5m 3s [ec2-user@ip-10-33-18-242 tests]$ cat ImportUniprotGoTestStats_uniprot_trembl.txt Statistics for program ImportUniprotGoTest: Input file: uniprot_trembl.xml There were 83955074 proteins inserted. The elapsed time was: 7h 21m 29s

If you go to the bottom of this doc page: https://github.com/bio4j/bio4j-titan/blob/master/docs/ImportingTitanBio4j.md you will find the different times spent on each of the modules (they are updated as soon as I get the numbers...)

As you can see there's a big change between the time spent when only reading/parsing the XML file compared to that obtained when doing the whole thing, which means that there should be a big room for improvement in terms of DB writing performance.

WDYT?

eparejatobes commented 9 years ago

@pablopareja I don't quite understand those numbers could you give a little bit more of detail?

pablopareja commented 9 years ago

I just added some new explanation text plus a few more recorded times, let me know if you still don't understand what they mean

eparejatobes commented 9 years ago

@pablopareja nice thanks. What I still don't see is what you mentioned above about XML processing time vs DB interaction

pablopareja commented 9 years ago

Yeah, what I meant is just that parsing the whole SwissProt (or TrEMBL) XML file actually does not take too long. That can be seen on the times obtained for example in ImportUniprotEnzymeDB which takes around 5 minutes; that would imply that the largest part of the time elapsed when executing for instance ImportUniprot with any of those two source files would be due to DB interaction...

eparejatobes commented 9 years ago

but where are those times for XML file juggling?? :cactus:

pablopareja commented 9 years ago

I didn't add those times here because they are part of tests. As I mentioned before so far I only implemented a test for the module UniprotGo and these are the differences in time:

UniprotGoTest SwissProt UniprotGo SwissProt
0h 5m 3s 2h 20m 35s

Here's the test program: ImportUniprotGoTestTitan.java

eparejatobes commented 9 years ago

In that case we could certainly improve on that; we don't need any consistency checks etc because

  1. all proteins/go terms are already there
  2. we are not creating anything linked with what we create at this step
pablopareja commented 9 years ago

So, how we can change that?

eparejatobes commented 9 years ago

you can switch on basically everything of what's described in http://s3.thinkaurelius.com/docs/titan/current/bulk-loading.html

pablopareja commented 9 years ago

OK I had a look at that and I added

conf.setProperty("storage.batch-loading","true");

and

conf.setProperty("ids.block-size", "1500000");

where possible

eparejatobes commented 9 years ago

but only for that module right? also all types are created before etc?

pablopareja commented 9 years ago

"storage.batch-loading" only for modules that don't create any vertex and "ids.block-size" just for ImportUniprot

eparejatobes commented 9 years ago

sounds good!

eparejatobes commented 9 years ago

keep me posted :)