Crash when run has spouse example with signalmedia-1m.jsonl

lanphan commented 8 years ago

Hi all, I'm trying to run deepdive and it goes pretty well with small dataset with example has spouse in tutorial. I think that deepdive should support to run larger dataset, so I downloaded signalmedia-1m dataset (around 1GB data), using articles.tsv.sh (customize a little) to extract all content to have full articles-1m.tsv file (around 210MB). With that file, I try to run again deepdive:

remove old stuff: drop database, remove 'run' folder + run 'deepdive compile'
then, using directly "deepdive do spouse_feature". It runs around 5 minute then crashed (my PC is restarted).

I think source data is rather small (210MB) comparing to my PC's configuration (MacOS, 16G RAM, core i7), but surprisingly that it crashed too soon (in step parsing document, I think in step to create sentences table because lots of java processes run and consume much more RAM, around 2.2G each process, total 5 or 6 java process).

Is this a bug mentioned in #478 ? Do you have plan to fix it?

Thanks in advanced.

Ps: Attached is screenshot of my postgres database, not much data there.

netj commented 8 years ago

There is a known rebooting issue with the latest OS X (10.11) that we're looking into. It's caused by a component called mkmimo which streams the data using nonblocking IO, and it seems latest OS X's kernel has issues with high rate of poll syscalls possibly with a combination of hardware (MBP). We don't have a clean solution yet but you can either use a Linux machine instead or tune the following env variables to keep a reasonable throughput while not crashing:

export THROTTLE_SLEEP_MSEC=10 # higher the safer but slower
export DEEPDIVE_NUM_PROCESSES=1 # lower the safer but slower

I'll update here once we find a fix for this.

lanphan commented 8 years ago

@netj Thanks for your quick response. However, as in #509, and I review there is really a fix value for THROTTLE_SLEEP_MSEC in case of MacOS:

    # OS specific workarounds via tweaking the environment
    case $(uname) in
        Darwin)
            # XXX mkmimo can reboot Mac unless its use of poll(2) is throttled
            export THROTTLE_SLEEP_MSEC=1
            ;;
    esac

Does that mean I need to customize value of THROTTLE_SLEEP_MSEC in deepdive?

I'll try and report here soon.

netj commented 8 years ago

Yes, you'll have to remove that part from the installation for the moment. I'll push an update soon that gets rid of it, and hopefully a new version of mkmimo that mitigates this issue by default.

lanphan commented 8 years ago

@netj Hi Jaeho, I did as you said, but it still crashed after around 9min. Is there anyway to workaround this bug? Should I increase THROTTLE_SLEEP_MSEC to 20 or 100?

Below is attached of my postgres db, comparing with the above image, data is increased a little, but still stuck in sentence table creation.

netj commented 8 years ago

@lanphan You can increase the THROTTLE_SLEEP_MSEC parameter to make it less likely to crash, but the throughput will become awful. Since you're eagerly looking for a solution, let's try some workarounds we currently have. These all involve replacing the mkmimo executable installed under util/ of your DeepDive installation.

First, let's keep a backup:

(set -eu; cd $(deepdive whereis installed util/); cp -pf mkmimo mkmimo.orig)

If you clone the fix-for-mac-reboots branch and run make, you get a mkmimo executable for replacement.

Actually, you can just run the following command to patch your installation, assuming deepdive is on your $PATH:
```
(set -eu; git clone https://github.com/netj/mkmimo.git mkmimo-wip --branch fix-for-mac-reboots; cd mkmimo-wip; make; install -v mkmimo $(deepdive whereis installed util/mkmimo))
```
With this one, you can use a higher value for THROTTLE_SLEEP_USEC (note this is *_USEC in microseconds not milliseconds) without sacrificing much throughput in some cases, e.g., export THROTTLE_SLEEP_USEC=100 which is 0.1ms. 10 gives good throughput but crashes quite often. You can try higher values like 1000, 10000, 20000, or even 100000 to be safe at the cost of some throughput.
If it's still hard to find the right parameter that doesn't crash your Mac, or you just want something that functions, try this dumb version written in bash. It's dumb and inefficient incurring a lot of disk I/O but should get you through the data flow without crashing your Mac. You can download it and replace the util/mkmimo file, making sure you turn on the executable bit. The following command does what I wrote above:
```
(set -eu; cd $(deepdive whereis installed util/); curl -fRLO https://github.com/netj/mkmimo/raw/bash-impl-poc/mkmimo.sh; chmod -v +x mkmimo.sh; install -v mkmimo.sh mkmimo)
```

Finally, if you want to restore the backed up original, here's the oneliner:

(set -eu; cd $(deepdive whereis installed util/); install -v mkmimo.orig mkmimo)

Hope this helps!

lanphan commented 8 years ago

@netj Before trying your proposed solution, I did report that I ran deepdive with THROTTLE_SLEEP_MSEC=50, deepdive runs well around 15 minute (I use "run well" because CPU and RAM usage is under control, CPU ~ 20% usage, RAM ~ 4G usage; and has only 1 java process), but it still crash after that. Data throughput is less than my 2nd try above.

Now I'll try with new mkmimo patch and report soon here. Thanks for your support.

lanphan commented 8 years ago

@netj Below is my result after trying your first point (using fix-for-mac-reboots branch of mkmimo):

1st try: export THROTTLE_SLEEP_USEC=1000 export DEEPDIVE_NUM_PROCESSES=1

Crashed after 1h37m, attached is my data in postgres.

2nd try: export THROTTLE_SLEEP_USEC=500 export DEEPDIVE_NUM_PROCESSES=1

Crashed after 7m, with data throughput is little over the case in my first comment.

3rd try: export THROTTLE_SLEEP_USEC=20000 export DEEPDIVE_NUM_PROCESSES=1

Crashed after 5m, with data throughput is almost the same with 2nd try above.

--> Conclusion: seems that it still has bug. I don't know why my first try with USEC=1000 can last to 1 hour 37 minute, but my second (USEC=500 < 1000) and third try (USEC=20000 > 1000) both failed so fast.

lanphan commented 8 years ago

@netj I think it runs ok with your dumb version written in bash (your 2/ approach). However, after running 3 hours, I got error from posgres (issue #523).

Below is my quick comparison between dump version and official mkmimo:

dump version: ++ run well ++ resource consumption: around 80% - 90% CPU usage, 10GB - 14GB RAM
official mkmimo: ++ got problem in MacOS ++ resource consumption: around 15% - 25% CPU usage, 2GB - 4GB RAM

I'm going to evaluate deepdive on Linux (Ubuntu) soon this week.

lanphan commented 8 years ago

@netj I can run successfully with approach 2/ after issue #523 fixed. However, it took me 8 hours to finish "deepdive do spouse_feature". Will I wait for your fix to improve speed, Jaeho?

In the meantime, I switch to Ubuntu desktop PC (32G RAM, core i7) to see it can run well and faster.

lanphan commented 8 years ago

Deepdive ran very well on Ubuntu, it finished "deepdive do spouse_feature" in around 3h40m (using official deepdive / mkmimo version, no patch). @netj I wonder that I can use your first patch in order to use _USEC (10 to 100) to improve performance?

netj commented 8 years ago

@lanphan Glad to hear that it works fine on Linux. There's no throttling done on Linux (those parameters default to zero) so the versions we tried on Mac won't have much difference. Actually it may have marginal improvement so no harm trying. The same instructions can be used.

If you're using Postgres, increasing the DEEPDIVE_NUM_PARALLEL_UNLOADS and DEEPDIVE_NUM_PARALLEL_LOADS from 1 to 3-4 may give you some more speedup.

lanphan commented 8 years ago

@netj Thanks for your tips, I'll try soon. Are these parameters (DEEPDIVE_NUM_PARALLEL_UNLOADS and DEEPDIVE_NUM_PARALLEL_LOADS) can be used with Greenplum too?

lanphan commented 8 years ago

@netj Setting DEEPDIVE_NUM_PARALLEL_UNLOADS=3 and DEEPDIVE_NUM_PARALLEL_LOADS=3 (run on Ubuntu) help to increase performance much more: only took 2h30m to finish deepdive do has_spouse (has_spouse is a step after spouse_feature, and deepdive do spouse_feature in previous comment took 3h40m already). Are these parameters useful for other db (Greenplum, PostgresXL)?

netj commented 8 years ago

@lanphan Yes those same flags work with different database drivers.

HazyResearch / deepdive

Crash when run has spouse example with signalmedia-1m.jsonl #522