Open pablopareja opened 9 years ago
sure, but what are we doing right now?
Nothing special I guess... :smiley:
Here is the only configuration detail we set up explicitly:
conf.setProperty("autotype", "none");
you had autotype on?? :fearful:
nooo! read it again, "none" :smile:
ok ok sorry I misread
@eparejatobes padawan
oi oi oi oi..... :stuck_out_tongue_closed_eyes:
OK so
storage.batch-loading
I'm not so sure on this one, but I think it should be ok to enable itids.block-size
their advice is to set it to the number of vertices you expect to add per Titan instance per hourthe other options do not seem that much relevant/useful here.
as far as I remember, the non-trivial part here was to decide when is it needed to commit transactions..
Well, about that, most of the times transactions must simply be committed as soon as a new element is created. For instance when importing Uniprot, at any time a new vertex such as an Interpro motif, Keyword, etc... is found for the first time, the transaction must be committed. This must be done prior to moving to the next protein, otherwise redundant vertices would be created and everything would be a mess...
I think that there's not a lot of room for improvement with our current import code; any significant reduction on import times should come from rethinking our data import strategy. About this, we should be able to take advantage of the fact that we can reorder our writes as we please. We could do one sequential file read per element type, with the order being determined by a kind of grading on the types. For example, you can first load all protein nodes, so that now when loading all edges incident with them you don't need to worry about any check on that side. In general, this means replacing local commits at the level of instances of types by so to say global ones where you load all the instances of a given type at once, then "commit" (close connection etc).
hey hey! I just got some results from this program: ImportUniprotGoTestTitan.java that I implemented to check out the time that was spent only on reading and parsing the XML file (that includes building all the temporal DOM structures needed to extract the specific data at each step of the program) Here you go:
Statistics for program ImportUniprotGoTest: Input file: uniprot_sprot.xml There were 546000 proteins inserted. The elapsed time was: 0h 5m 3s [ec2-user@ip-10-33-18-242 tests]$ cat ImportUniprotGoTestStats_uniprot_trembl.txt Statistics for program ImportUniprotGoTest: Input file: uniprot_trembl.xml There were 83955074 proteins inserted. The elapsed time was: 7h 21m 29s
If you go to the bottom of this doc page: https://github.com/bio4j/bio4j-titan/blob/master/docs/ImportingTitanBio4j.md you will find the different times spent on each of the modules (they are updated as soon as I get the numbers...)
As you can see there's a big change between the time spent when only reading/parsing the XML file compared to that obtained when doing the whole thing, which means that there should be a big room for improvement in terms of DB writing performance.
WDYT?
@pablopareja I don't quite understand those numbers could you give a little bit more of detail?
I just added some new explanation text plus a few more recorded times, let me know if you still don't understand what they mean
@pablopareja nice thanks. What I still don't see is what you mentioned above about XML processing time vs DB interaction
Yeah, what I meant is just that parsing the whole SwissProt (or TrEMBL) XML file actually does not take too long. That can be seen on the times obtained for example in ImportUniprotEnzymeDB which takes around 5 minutes; that would imply that the largest part of the time elapsed when executing for instance ImportUniprot with any of those two source files would be due to DB interaction...
but where are those times for XML file juggling?? :cactus:
I didn't add those times here because they are part of tests. As I mentioned before so far I only implemented a test for the module UniprotGo and these are the differences in time:
UniprotGoTest SwissProt | UniprotGo SwissProt |
---|---|
0h 5m 3s | 2h 20m 35s |
Here's the test program: ImportUniprotGoTestTitan.java
In that case we could certainly improve on that; we don't need any consistency checks etc because
So, how we can change that?
you can switch on basically everything of what's described in http://s3.thinkaurelius.com/docs/titan/current/bulk-loading.html
OK I had a look at that and I added
conf.setProperty("storage.batch-loading","true");
and
conf.setProperty("ids.block-size", "1500000");
where possible
but only for that module right? also all types are created before etc?
"storage.batch-loading"
only for modules that don't create any vertex and "ids.block-size"
just for ImportUniprot
sounds good!
keep me posted :)
Perhaps we should put in practice some of the recommendations here:
http://s3.thinkaurelius.com/docs/titan/current/bulk-loading.html
@eparejatobes What do you think?