where to speed Falcon up

dgordon562 commented 8 years ago

Hi, Chris,

1) forget about trying to speed up the 0-rawreads daligner (rj_*) step. daligner is running at 390% cpu so would not benefit from copying the .raw_reads.bps to /tmp. You were right and I was wrong about this. No benefit here. I guess you could, however, write the temporary las files to /tmp so that LAsort/merge can run on them there. That would help, although it is a minor issue compared to 2 and 3 below (It doesn't require copying .raw_reads.bps--just copying the L1 las files back to the network filesystem after all LAsort/merge are completed.)

2) The 0-rawreads L1/L2 LAsort/merge (rp_) is a bear on nfs-mounted file systems. In fact, in a typical assembly, it so overwhelms nfs that it times out, putting a few jobs in Eqw state so Falcon can't proceed. I don't know the solution to this. We're working on it. If you have a good idea, great.

3) The 0-rawreads LA4Falcon (c_) is THE WORST. It always has been. I originally copied all files to /tmp to run it there. That improved performance by 50x. More recently we cache .raw_reads.bps in memory (either using the linux buffers or using /dev/shm) and that gives another 10x improvement. I haven't yet put the latter into run.py so I have to kill Falcon, do some manual steps, and restart Falcon. These changes are very site-specific but absolutely necessary. If you can think of some general solution for everyone, great.

4) 1-ovl and 2-asm stages are so brief compared with those above that they probably aren't worth worrying about.

Best wishes, David

pb-cdunn commented 8 years ago

2) The 0-rawreads L1/L2 LAsort/merge (rp_)

We could add a concurrency setting for merge, separate from consensus. Merge is quick, but it hammers the filesystem. It probably should be run serially (concurrency=1).

Do you think we need to copy the inputs into /tmp? My theory is that the bigger problem is from writing into NFS a little at a time, since reading can be cached by the filesystem. Data is helpful.

3) The 0-rawreads LA4Falcon (c_)

fc_consensus is pretty slow, so I'd hate to lose its concurrency. We'll have to prioritize running that in /tmp.

we cache .raw_reads.bps in memory (either using the linux buffers or using /dev/shm)

We altered fasta2DB to use open(file, "rm") instead of just "r" so it uses mmap. We could try that for .bps too.

dgordon562 commented 8 years ago

Do you think we need to copy the inputs into /tmp?

Dunno. Requires some performance testing. It certainly would speed up LAsort/merge. The question is whether it speeds it up enough to make up for the rsync time of copying to/from /tmp.

Would it solve the hammering of nfs-mounted filesystems? In our case this is so severe that it causes nfs to time-out (never respond) and thus the LAsort/merge goes into Eqw state (error) and is suspended. That means that even if all other LAsort/merge jobs complete, fc_run.py waits forever on this one Eqw job. Sometimes that occurs in the middle of the night. I notice in the morning, continue the job which completes quickly and thus Falcon can continue. But you lose a lot of time that way. I now have set up a program that does nothing except check every 2 minutes for jobs in Eqw state and continue them. That gives you an idea of the magnitude of the problem for us.

pb-cdunn commented 8 years ago

This is actually an internal priority as well, and we do appreciate your help in providing measurements and suggestions.

I now have set up a program that does nothing except check every 2 minutes for jobs in Eqw state and continue them.

That's interesting! The new process-watcher (used by fc_run1) can tell when a job has stalled or died because all jobs periodically write to a "heartbeat" file. But you're saying that a stalled job can actually be continued. We might be able to do that for you eventually. But of course, the first issue is to avoid hammering NFS in the first place.

dgordon562 commented 8 years ago

I've got a suggestion for the LAsort/merge stage of 0-rawreads

First, just to give you an idea of how slow the LAsort/merge step is, look at this top output in which LAmerge is in top's D state (waiting on disk) and the processes are getting 8.6%, 4.0%, 3.3%, 3.0% and 2.6% cpu time (miserable).

118193 dgordon   20   0 8942m 4.3g 2.1g D  8.6  1.3   0:04.94 LAmerge                                   
118318 dgordon   20   0 4478m 1.2g 593m D  4.0  0.4   0:01.44 LAmerge                                   
118324 dgordon   20   0 4041m 407m 195m D  3.3  0.1   0:00.62 LAmerge                                   
118397 dgordon   20   0 3871m  51m  25m D  3.0  0.0   0:00.09 LAmerge                                   
118393 dgordon   20   0 4002m 244m 121m D  2.6  0.1   0:00.32 LAmerge

I've never seen these at more than 11% cpu and often they are at less than 1% cpu.

My suggestion is to do what you used to do: Write the L2 and L3 las files to /tmp. Then copy the final raw_reads.las back to the network disk. It might be ok to copy the L1's to /tmp before you start, but then again it might be faster to read them off the network disk.

dgordon562 commented 8 years ago

Well, I just did a very rough experiment (see what stage 10 rp processes were in) which shows that 90% of the rp process time is spent in the LAmerge that merges L1 files to L2 files. So...I would say that my suggestion about might only make a small improvement. Perhaps tar/copy/untar the L1's to /tmp is the way to go.

dgordon562 commented 8 years ago

the 0-rawreads LAmerge step is bringing down our cluster: dozens of these jobs go into error state because both nfs and gpfs timeout. People are complaining and having me suspend most of my jobs. I am proceeding as we speak to modify the rp*sub.sh script to copy the L1's to /var/tmp, run LAmerge there, and copy the single raw_read.*.las back. I am doing this outside of fc_run.py now, but might want to put this into fc_run.py later--any advice is welcome...

pb-cdunn commented 8 years ago

It will end up in pwatcher, which you are not even using.

pb-cdunn commented 8 years ago

For now, we only write into /tmp. After a task has fully completed, we move all outputs back to NFS. That will solve you problem with restarts, which should be rarely needed now anyway.

You should have the 50x speed-up for LAmerge and fc_consensus. I remain skeptical of the extra 10x speed-up for LA4Falcon. It's possible that the primary cost is just caching the entire file, and that would be done once even if we copied it to /tmp first. I'd like to see measurements of only LA4falcon (sans fc_consensus) running several times sequentially, in both tmp and NFS. If you could try that in your own network, that would provide valuable data.

dgordon562 commented 8 years ago

Hi, Chris,

I haven't done LA4Falcon on either /tmp or nfs (or gpfs) for the last 6 months. My 10x performance increase came from running LA4Falcon in memory. That resulted in one of my assemblies going from 3 weeks to 3 days. I now copy the files to /dev/shm and run LA4Falcon from there.

But be aware that these performance improvements are only apparent when you are running ~100 jobs concurrently. If there is just a few jobs, you don't see the huge performance degradation of nfs. And I really don't want to sink our filesystem (as I used to) to get those stats. Long ago we did such a measurement and saw 50x improvement (nfs vs /tmp, which we no longer use). But I didn't time just LA4Falcon--it was the whole c_ job.

I recently modified Falcon to rsync .raw_reads.bps to /dev/shm and it works great so far... Yes, the speed-up is with LA4Falcon. In fact all 8 fc_consensus are running at 100% cpu and they slow LA4Falcon down.

I hope this answers some of your questions.

Best wishes, David

pb-cdunn commented 8 years ago

I really don't want to sink our filesystem (as I used to) to get those stats.

I understand. But could you compare /dev/shm with /tmp? I doubt that shared memory is critical. I suspect that the problem is random-access of a large file in NFS.

dgordon562 commented 8 years ago

OK, I'm copying over the .raw_reads.bps. I'll do the test tomorrow barring crises.

dgordon562 commented 8 years ago

Hi, Chris, You've helped me out a lot in the past so I put a serious effort into getting you these numbers. I did a rough test (not perfect because I had problems running both tests on the same processor).

processor 1, having .raw_reads.db in /var/tmp:

time LA4Falcon -H15000  -fo /var/tmp/dgordon/falcon/test_dunn/raw_reads ../las_files/raw_reads.1.las >~/sunny/temp

which gave:

real 33m48.237s user 0m48.499s sys 0m27.420s

processor 2, having .raw_reads.db in /dev/shm:

time LA4Falcon -H15000  -fo /dev/shm/dgordon/falcon/test_dunn/raw_reads ../las_files/raw_reads.1.las >~/sunny/temp

gave:

real 1m19.552s user 0m52.882s sys 0m12.889s

I'm not sure which processor is faster, but you can tell that in the 1st case the real time is 33x as much as cpu time while in the 2nd case they are about equal.

This fits with my experience. If you still aren't convinced, and want me to do both experiments on the same processor, that will have to wait until they aren't loaded, which won't be for at least a week.

My point is that I get a 10x performance improvement using the .raw_reads.bps in memory vs local disk. And I get a 50x performance improvement of local disk over network disk. But, again, you only see these kind of performance improvements with large assemblies.

Your thoughts?

Best wishes, David

pb-cdunn commented 8 years ago

I suspect that your local disk is not really a "local" disk.

Are you running on a virtual machine?
- Or maybe the other jobs on your machine are swapping virtual memory?
How much physical RAM is on your machine?
- And what is the output of swapon -s?
How large are the .raw_reads.* files? ls -lhH .raw_reads.*
- Also ../las_files/raw_reads.1.las and the ~/sunny/temp output?
How long did it take to copy .raw_reads.* into /dev/shm?

Optionally using /dev/shm for inputs is certainly a possibility for FALCON. I just want to understand this first. If the files do not fit into physical memory, then /dev/shm is using virtual memory, in which case there should be no speed-up at all over the disk that is used for swap-space, which presumably is /var/tmp.

One other thing: Does you machine have /var/tmp? I rely on Python tempfile.gettempdir() to tell me what to use. Could you run this? python -c 'import tempfile; print(tempfile.gettempdir())'

pb-cdunn commented 8 years ago

I suspect that your /var/tmp is actually on a network, possibly SAN (Storage Area Network, often used with Citrix virtualization, and known to be slow for cached files).

But here are a few more ideas/question:

Try df /var/tmp /tmp /dev/shm
echo $TMPDIR
Could you repeat your test in /tmp? (That might be tmpfs, the same file-system as /dev/shm.)
Run mount -l. If that's too much output (it's 67 lines for me, since we mount our HOME directories), then post just the relevant parts.

$ mount -l
...
none on /run/shm type tmpfs (rw,nosuid,nodev)
/dev/mapper/Ubuntu--14--vg-root on / type ext4 (rw,errors=remount-ro)

Those were the mounts listed by df earlier. It says that we are using tmpfs for /dev/shm (/run/shm) and ext4 for /dev/mapper/Ubuntu... (/tmp and /var/tmp).

And finally, would you mind repeating your test with a slightly altered version of LA4Falcon? I'd like to use open("raw_reads.bps", "r") instead of open("raw_reads.bps", "rm"), since memory-mapping might be the problem. That is in DAZZ_DB/DB.c, Fopen() (at line 82). Just change line 93 to use mode instead of newmode. Then somehow clean and rebuild DAZZ_DB and DALIGNER, and run with that version of LA4Falcon.

dgordon562 commented 8 years ago

Hi, Chris,

I got the data you wanted. First, the most important question of whether /var/tmp really on the network. It is not. According to our system admins, each node has a single SATA disk which is used for both /tmp and /var/tmp and is not shared with any other node.

e219 has 132 G of RAM (where I ran the /var/tmp test)

other jobs swapping?? I don't think so--our sge requires jobs to reserve physical memory. If they use more than they reserve, they are killed. It doesn't allow more reservations than the amount of physical memory.

> swapon -s
Filename                                Type            Size    Used    Priority
/dev/dm-1                               partition       16777212        32648   -1
{dgordon}e219:~

>  ls -lhH .raw_reads.*
-r--r--r-- 1 dgordon eichler-vol26  88G Apr 21 09:25 .raw_reads.bps
-r--r--r-- 1 dgordon eichler-vol26 1.3G Apr 21 09:25 .raw_reads.idx

> ls -lhH ../las_files/raw_reads.1.las ~/sunny/temp
-r--r--r-- 1 dgordon eichler-vol26 5.5G May  3 21:06 ../las_files/raw_reads.1.las
-rw-rw-r-- 1 dgordon eichlerlab    7.0G May  5 18:43 /net/gs/vol1/home/dgordon/sunny/temp

How long did it take to copy .raw_reads.* into /dev/shm? 30 minutes (see below). But it saves much more.

> time rsync -v --bwlimit=50000 .raw_reads.bps .raw_reads.idx raw_reads.db /dev/shm/dgordon/falcon/test_dunn
.raw_reads.bps
.raw_reads.idx
raw_reads.db

sent 94961896701 bytes  received 69 bytes  50958892.82 bytes/sec
total size is 94950305895  speedup is 1.00

real    31m2.205s
user    8m15.357s
sys     4m47.772s

> python -c 'import tempfile; print(tempfile.gettempdir())'
/tmp

We use /var/tmp rather than /tmp here because there is a program that deletes files from each, and it gives files in /var/tmp longer than those in /tmp

> df /var/tmp /tmp /dev/shm
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/system-root
                     944615844 230833316 665792156  26% /
/dev/mapper/system-root
                     944615844 230833316 665792156  26% /
tmpfs                 66039608        48  66039560   1% /dev/shm

> echo $TMPDIR
/tmp

> mount -l
/dev/mapper/system-root on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda1 on /boot type ext4 (rw)

Convinced yet? (grin)

Best wishes, David

pb-cdunn commented 8 years ago

I always believed you. I just needed to understand why. Those are very helpful data. But I'm still confused about this:

tmpfs                 66039608        48  66039560   1% /dev/shm

That's 66GB. I don't understand how you can fit 90GB into a 66GB filesystem. That's an open mystery for now.

So, it takes 30m to copy 90GB from nfs to shm or tmp. And it takes 30m to read that same DB from tmp. That means nfs is just as good as tmp for you. We might still improve network performance by copying-into/reading-from tmp, so we'll make that possible someday. But the gain seems to be much less than the gain for writing into tmp.

We'll make the tmpdir configurable, so you can use /var/tmp. Eventually, we'll also make the DB-dir configurable, so you can use /tmp, /var/tmp, or even /dev/shm. Not everyone has enough memory available to make it the default.

It's also important to note that your speed-up occurs only after the 30m cost of copying into /dev/shm on a given machine. So the biggest gain comes from running fc_consensus sequentially on a given machine with the DB in /dev/shm. So we need a mode which combines several (or all?) consensus jobs and runs them sequentially as a single pypeflow task. The question is: How many consensus tasks should be combined? You've shown that reading the DB via LA4Falcon takes only 1m. Do you have any idea how long fc_consensus.py takes after the DB has been copied to /dev/shm? (The command is piped: LA4Falcon -Hn -fo raw_reads foo.las | fc_consensus.py ...)

dgordon562 commented 8 years ago

Hi, Chris,

I don't combine the LA4Falcon tasks. What I do is restrict them to certain nodes. There are a variety of ways to do that with sge parameters, and I've tried several of them. Most recently I had some nodes designated with a requestable which we call "eicher_falcon_node". I make a hard request in qsub for:

qsub -l eichler_falcon_node=TRUE

As each LA4Falcon job starts, it takes out a write lock, and checks if raw_reads.db is already in /dev/shm If not, it copies the 3 big files there and then releases the lock. If they are there, it immediately releases the lock. In either case it runs LA4Falcon.

I think the reason my timing results show the same time for nfs vs /tmp is that nfs is not saturated. In real life (running Falcon--not just doing a single timing test), I see a 50x performance difference of /var/tmp over nfs.

You asked about how long LA4Falcon | fc_consensus.py takes. It is roughlly 1/2 hr with 8 workers each running at 100% cpu. fc_consensus.py is the bottleneck but I'm not complaining--I'm ecstatic to get this done in 1/2 hr. If I made more workers, I would have fewer jobs scheduled so I'm not sure throughput would increase.

I did look at your most recent changes. you still have "set -vex" and "LAmerge -v" . On our cluster, that kills us. There is just so much output from all of those jobs to sge_log that nfs times out and the job is immediately put into error state, before any code runs. I strongly urge you to eliminate all of that output. I still have LAmerge -v but I've changed the "set -vex" to "set -e"

Best wishes, David

pb-cdunn commented 8 years ago

I did look at your most recent changes. you still have "set -vex" and "LAmerge -v" .

Good point. I'll suppress that by default.

flowers9 commented 8 years ago

I guess you could, however, write the temporary las files to /tmp so that LAsort/merge can run on them there. That would help, although it is a minor issue compared to 2 and 3 below (It doesn't require copying .raw_reads.bps--just copying the L1 las files back to the network filesystem after all LAsort/merge are completed.)

Just to (belatedly) chime in, it does prevent having to read & write the entire stage-0 alignment data remotely once, which can be quite a lot of data. It also helps space out the writing load, as LAmerge/sort happening all at once tends to max out i/o channels if you let it. You don't even need to write the L1 files to the local disk - you can just have LAmerge take inputs locally and write output remotely.

pb-cdunn commented 8 years ago

@flowers9, I'll try to clarify your points and let you tell me if I've misunderstood anything.

it does prevent having to read & write the entire stage-0 alignment data remotely once

Each level of merge involves the same amount of data in .las files. By combining the first level with each daligner job, we have removed only 1 remote write (of typically about 3). They are staggered by the different moments of conclusion of the daligner jobs, which is good. But they are restricted to the same resources used by the daligner jobs, which is not ideal since we want different restrictions. For daligner jobs, we want a certain amount of CPU and memory available; for sort/merge, we want a certain amount of network bandwidth available. So we might want lower "concurrency" for sort/merge. But I think the natural staggering makes the current solution work pretty well.

You don't even need to write the L1 files to the local disk - you can just have LAmerge take inputs locally and write output remotely.

If you run with use_tmpdir = true, that will happen today. A daligner job will run in tmp, and the sort and first-level merge will also happen right there. Only the next level .las files are written back to the network. (That's what you mean by "written remotely", yes?)

Note that because of my simple solution, we only write into /tmp today. dgordon has suggested that we might see a significant improvement by first copying .bps into /dev/shm (shared memory, much faster than tmp), at least for the consensus tasks, since they have so much random access. But that's separate from sort/merge.

flowers9 commented 8 years ago

Yes, I meant you skip one round of reading/writing the las files in the primary directories (doing it instead in temporary directories). As each round touches each las file once, this means skipping one read/write of the entire las database, as I said.

use_tmpdir does address my issue, pretty much (it's not as extremely obvious option, so I hope you excuse my overlooking it ;). The current use_tmpdir is slightly inefficient, as LAsort will write the final files into the tmp directory, and then they are copied to the final directory. If the wrapper script were more integrated with the program script, it would be possible to avoid writing the final files to the temporary directory.

pb-cdunn commented 8 years ago

I see. You like the 100% win of writing purely intermediate files into the tmp directory, but you are skeptical for final files.

I assert that, in practice on a variety of users' filesystems, writing into tmp and copying to network is better than writing directly to network, even for serial writes. For example, in lustre the difference is huge, since lustre loves to work with large files. And I doubt that an extra write to tmp ever costs much.

... But it could compete with other jobs on the same machine. Is that your concern?

PacificBiosciences / FALCON

where to speed Falcon up #345