bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

(Optionally) Describe caveats for the use of network filesystems with the pipeline #153

Closed lbeltrame closed 10 years ago

lbeltrame commented 10 years ago

The reason for tihs issue is that while reorganizing the layout of one of our two clusters, I attempted to use a distributed FS to reduce the load on a single machine. The issue that emerged was that the intensive (and fast) reads and writes of the bamprep, recalibration, and variant calling processes (plus the file transactions) makes it unreliable unless there's very fast connectivity (fiber / infiniband as opposed to gbit Ethernet).

Perhpas some information on the setups used by bcbio-nextgen should be collected as to provide help for new deployments

chapmanb commented 10 years ago

Luca; Great point. I added some documentation about networked file systems here. Please feel free to add details about your experience and setup to it:

https://bcbio-nextgen.readthedocs.org/en/latest/contents/parallel.html#io-and-network-file-systems

The blog post linked from that section has more details about the improvements made by getting faster connections. My experience is that splitting up processing into smaller file sections helps immensely over reading/writing large BAM files, with the latter being incredibly slow on systems with limited bandwidth.

The pipeline has been evolving to avoid disk IO as much as possible for exactly this reason. I'm not sure if you saw our latest post, but with FreeBayes and GATK HaplotypeCaller pipelines you can skip the whole recal, realign which avoids a lot of this IO penalty:

http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/

I'm planning on having an optimized BAM-prep free version in the next release. Hope this helps, Brad

lbeltrame commented 10 years ago

splitting up processing into smaller file sections helps immensely over reading/writing large BAM files, with the latter being incredibly slow on systems with limited bandwidth.

Indeed it's a great help, however it is ill-suited for distributed filesystems where network connectivity is not optimal (like my cluster, but it's a donated resource so there's little I can do on that regard), as the reads/writes tend to cause out of sync scenarios. Not a flaw by any means, but a caveat to watch for.

FreeBayes and GATK HaplotypeCaller pipelines you can skip the whole recal, realign which avoids a lot of this IO penalty:

Unfortunately I'm stuck doing paired tumor/normal work. ;) I'll need to investigate whether I can optimize the variant calling part for tumor/normal samples.

chapmanb commented 10 years ago

Luca; What do you mean by out of sync scenarios? Could you be more specific about the errors you're seeing? My experience on setups with less connectivity (1GigE) is that they'll be slower due to all the network traffic, but nothing beyond that in terms of errors. Are you finding that writing large files all at once does better than having it in sections?

lbeltrame commented 10 years ago

What do you mean by out of sync scenarios? Could you be more specific

Setup:

Two groups of eight servers (which are both compute and storage) with two glusterfs volumes (distributed with a replica count of 2) one holding the indices and the large files, and another to perform the runs (containing the directory "work" and the projects).

In total, the nodes doing the calculation are 34. Connection is Gbit Ethernet, but the servers do not share the same enclosures.

What happens during bamprep and recalibration is that for some reason during the intensive read-writes, gluster self heal is triggered and fails, causing some elements of the chain to have files that aren't replicated properly. The result is a slew of odd errors like missing directories, or errors with permissions.

The fact that I can't separate compute from storage worsens the matter (high load, around 24, when running the pipeline).

It may be as well an issue of setup, but as I said, these are donated resources with no support whatsoever, and system administration falls entirely on the people doing bioinformatics at my institution (me and another).

traffic, but nothing beyond that in terms of errors. Are you finding that writing large files all at once does better than having it in sections?

To be honest I'm not sure, as I'm having a hard time pinpointing the exact cause of the issue.

jwm commented 10 years ago

Hi Luca, I work for the support org that runs some of the infrastructure Brad uses.

We have some experience with Gluster, and the fact it's triggering a self-heal during normal operation is usually bad news. Just to confirm, you're only having the pipeline write to the glusterfs volume itself, and not directly to any of the brick filesystems? Unless there's some network event that disconnects one or more brick servers, the volume should remain consistent regardless of the I/O workload you're throwing at it. We've sent heavy workloads over single gig-e connections to a couple of Gluster 3.3 volumes here, and they've been fine.

Are all of the brick servers up (and reachable on the network) the entire time you're running the pipeline? It might help to take a look at the physical layer: are all of the network interfaces running clean (i.e., does 'ifconfig' show any drops/errors/etc.)?

What version of Gluster is this, 3.3 or 3.4? Is there anything relevant in the Gluster logs (/var/log/glusterfs/*), and have you tried increasing its debug output (http://gluster.org/community/documentation/index.php/Translators/debug/trace)?

Finally, if you don't need Gluster's replication, I'd recommend against running it. The self healing and potential for split-brain conflicts that one has to reconcile manually weren't worth it for us, and we stopped doing replicated volumes. I'm not sure what your underlying disk subsystems look like, but we were running brick filesystems on RAIDed volumes, so we could handle disk failure on the RAID level instead of relying on Gluster replication.

lbeltrame commented 10 years ago

Hello John,

you're only having the pipeline write to the glusterfs volume itself, and not directly to any of the brick filesystems? Unless there's some network

Indeed, no write happened on the bricks. I'm fearing however that having storage and compute on the same node causes issues (see my previous post on why I can't do otherwise).

physical layer: are all of the network interfaces running clean (i.e., does 'ifconfig' show any drops/errors/etc.)?

I'll check back, but I had no issues in connectivity AFAICR.

What version of Gluster is this, 3.3 or 3.4? Is there anything relevant in

3.4 for now, and I noticed in the logs that files were created with null gfids, something that wasn't supposed to happen (searched around but most information was outdated or unanswered).

Finally, if you don't need Gluster's replication, I'd recommend against

I actually don't need it, as the data is also kept off-network on a RAID FS which is replicated and backed up daily.

one has to reconcile manually weren't worth it for us, and we stopped doing

I will try a purely distributed volume. My servers are single disk, but as I said, I have to work with what was given me.

Thanks for the insight!