brmson / yodaqa

A Question Answering system built on top of the Apache UIMA framework.
http://ailao.eu/yodaqa
Other
619 stars 205 forks source link

Setting Up Freebase Node: Strange Numbers #26

Closed k0105 closed 8 years ago

k0105 commented 8 years ago

Hi,

so I'm still setting up the Freebase node. Since the first approach still runs and takes forever (6.2 million seconds now for 2 billion DB entries), I decided to try again. So I build a second PC, also quadcore, also 20GB RAM, this time 3TB instead of 2TB HDD, AMD instead of Intel and Debian 8.2 instead of CentOS 7 - same system specs, essentially, but it runs much faster [so I probably messed up something in the first run]. However, it is still confusing: It now has added 2,947,600,000 entries in 237,113 seconds (less than 66 hours). But according to Google's webpage Freebase should only have 1.9 billion triples [cmp. https://developers.google.com/freebase/data?hl=en no warranty for external links]. Also: The average steadily declines. From about 25,000 I'm now at about 12,500.

Is this still on track? Any idea how much longer this will take / how many entries there are in total?

Best wishes, Joe

pasky commented 8 years ago

Glad to hear the import is going better this time around.

I'm not sure what do you mean by "entry", but some time ago I've counted the lines in the turtle file because of something (boy that took long time too):

$ zcat /d-raw/freebase-rdf-2015-01-11.ttl.gz | wc -l
1811027188

So you really should see around this number of triplets.

I think the import took about 3 days for me. But that was including the sort time, and it sounds like you are still in the gradual import phase. What filesystem are you using? This should be pretty much linear time operation I think, I don't think slowing down sounds right. Does Fuseki use a lot of the RAM?

k0105 commented 8 years ago

Well, since this system is much faster I can actually inspect system parameters. Here goes nothing:

Fuseki uses 15,913,xxx K, so roughly 14.9GB of 20GB is used. Regarding the numbers: I also don't know what an "entry" is, but that's what the terminal shows. For instance: INFO Add: 3,043,300,000 Data (Batch: 2,355 / Avg: 11,749)

CPU is almost unused - 1-2%. RAM about 75%. What is really strange: /tmp only takes 44.0KB. Since my main HDDs only have 100GB and 250GB respectively, I mapped /tmp onto a folder on the second internal 2/3TB HDDs like this: mount --bind /media///tmp/ /tmp This seems to work - when I enter /tmp I get to the 3TB HDD and when I manually create a file in /tmp it shows up on the secondary HDD as expected. There is still 1.7TB free space (in addition to over 80% free space on the primary HDD), so this shouldn't be a problem. The file system is ext4 (I checked via df -T /dev/sdb1). Finally: The system is rather responsive, it doesn't seem challenged at all.

I found two interesting problems: According to iotop IO is at 85-99%, so that seems to be a bottleneck. Also, it lists the process as java -Xmx1200M ..., which is of course totally wrong. Interesting - I definitely changed the number in the fuskei-server script to 6400M (I just verified this), so this was overridden somehow. I usually set it to 4096 globally [export _JAVA_OPTIONS="-Xmx4096m" and export JAVA_OPTS="-Xmx4096m" in .profile], but since I haven't done anything on the machine, yet, I did NOT set this in .profile and the default should be around 1200M indeed. Changing the script as instructed does not seem to be enough.

Really wish I could estimate my progress: Is 3 billion much? Since I have enough other work, I guess I'll just let it run a bit. If there are 1.9 billion triples (as you confirmed), the worst case I could possibly imagine is that each results in three entries, so 5.7 billion. I now have over 3 billion entries in 261199 seconds (almost exactly 3 days), so I should be at 5.7 in about 5 days when performance keeps decreasing.

pasky commented 8 years ago

Hmm, let's see how it goes then. It's been a year now since I've done that import, and I don't have the logs anymore. It might be that the java memory issue is slowing things down because of garbage collection, but it might not be worth it to restart it at this point.

Just be aware that after the initial add-data phase, sorting will ensue and that'll take pretty long too. (Like - maybe a day? ...)

k0105 commented 8 years ago

Things become stranger: Adding just finished with INFO Total: 3,130,753,066 tuples : 287,562.31 seconds : 10,887.22 tuples/sec

I'm now at Index Phase, Index SPO & Build SPO done, now Index POS. We'll see how it goes, but I'm cautiously optimistic.

k0105 commented 8 years ago

OK, I've been busy for the last 12 hours, but it finished in the meantime. A couple of notes for anyone doing this at a later time: a) Set export _JAVA_OPTIONS="-Xmx6400m" and export JAVA_OPTS="-Xmx6400m" manually, changing the script might not be enough. b) 3-4 days on a normal quadcore or octocore system with 20GB is a good estimate - if it takes much longer, you might want to start worrying. c) 3,130,753,066 tuples is a strange number, but apparently correct. d) If you need to map /tmp to a secondary drive mount --bind /media/.../.../tmp/ /tmp should work.

Thanks much, Petr - as always, you've been a great help.