genome / gms

The Genome Modeling System installer
https://github.com/genome/gms/wiki
GNU Lesser General Public License v3.0
78 stars 23 forks source link

How to provision AWS instance with more than two volumes #195

Closed julyin closed 8 years ago

julyin commented 8 years ago

Hi,

When using an Amazon Instance with more than two disks (eg. i2.8xlarge), how should the storage be provisioned? Do I need to combine all the disks into one large volume called opt, or do I mount them separately and GMS will be able to see and utilise them? Do they need to be named in any particular way?

Many thanks, Julia

gatoravi commented 8 years ago

Hi Julia, The way we provision the system is here, https://github.com/genome/gms/blob/ubuntu-12.04/setup/aws/preinstall.sh As you can see the "#Mount ephemeral storage" section shows the mounting of different volumes. You could have more than two volumes mounted using a similar procedure(I think as per Amazon's convention these would just be named xvdb, xvdc, xvdd etc.) We don't have any hard numbers for required storage, but in my experience to be safe /tmp/ will need at-least 200 GB, the rest of the disk space can go on /opt. A lot of the annotation data and other files go on /opt so it is recommended that /opt/ has more space allocated to it than /tmp/. Let us know how this works for you, Avi

GrubLord commented 8 years ago

Thanks for your response, @gatoravi - however, how do you recommend that users combine the various xvd[x] volumes into the one /opt store?

My solution would be to reformat everything and use LVM to create a single logical volume from all available storage drives, mounted under /opt - however, your provided AMI does not use LVM at all, so... is there a more convenient option that doesn't require wiping everything?

Can GMS be configured to use multiple volumes for storage?

gatoravi commented 8 years ago

Interesting idea, would it be hard to set up LVM under that AMI? Unfortunately I am not the best person to advise on disk management, we would strongly advise checking with other systems personnel if possible. I'm hoping someone else might chime in on this issue.

sakoht commented 8 years ago

Can GMS be configured to use multiple volumes for storage?

We had the exact same situation in the original GMS environment. There are petabytes of disk spread across multiple volumes (In the system as "Genome::Disk::Volume", on the command-line as "genome disk volume ....").

The GMS uses a "disk allocation system", and tracks the volumes available in the database. There is one initial volume representing the local disk on the master host itself, but you can add others.

I'll need to get onto a machine to test to give you more precise instructions, since inside Wash-U an automated cron keeps this list in sync with activity from the sysadmins.

I'll have to do a test to give you more precise instructions.

If you are impatient you can try this: genome disk volume list

and perl -e 'use Genome; @v = Genome::Disk::Volume->get(); print Data::Dumper::Dumper(\@v)';

Note that you see the one volume for your local disk.

If you were making a new volume via your own scripts, it would look something like this, with "..." replaces with similar properties as you see above, but a different relative path: use Genome; Genome::Disk::Volume->create(...); UR::Context->commit();

I'll be back with you after seeing if there is an elegant command-line alternative.

GrubLord commented 8 years ago

Nice - I getcha.

Then I'll wait and see your command line stuff, or maybe try and code up a command myself with what you pointed me to.

-- Liviu

On 28 Nov 2015, at 11:03 am, Scott notifications@github.com<mailto:notifications@github.com> wrote:

Can GMS be configured to use multiple volumes for storage?

We had the exact same situation in the original GMS environment. There are petabytes of disk spread across multiple volumes (In the system as "Genome::Disk::Volume", on the command-line as "genome disk volume ....").

The GMS uses a "disk allocation system", and tracks the volumes available in the database. There is one initial volume representing the local disk on the master host itself, but you can add others.

I'll need to get onto a machine to test to give you more precise instructions, since inside Wash-U an automated cron keeps this list in sync with activity from the sysadmins.

I'll have to do a test to give you more precise instructions.

If you are impatient you can try this: genome disk volume list

and perl -e 'use Genome; @v = Genome::Disk::Volume->get(); print Data::Dumper::Dumper(\@v)';

Note that you see the one volume for your local disk.

If you were making a new volume via your own scripts, it would look something like this, with "..." replaces with similar properties as you see above, but a different relative path: use Genome; Genome::Disk::Volume->create(...); UR::Context->commit();

I'll be back with you after seeing if there is an elegant command-line alternative.

— Reply to this email directly or view it on GitHubhttps://github.com/genome/gms/issues/195#issuecomment-160228136.

NOTICE Please consider the environment before printing this email. This message and any attachments are intended for the addressee named and may contain legally privileged/confidential/copyright information. If you are not the intended recipient, you should not read, use, disclose, copy or distribute this communication. If you have received this message in error please notify us at once by return email and then delete both messages. We accept no liability for the distribution of viruses or similar in electronic communications. This notice should not be removed.

sakoht commented 8 years ago

So there is currently no genome disk volume create.

I made a test environment and scripted the following to add an additional volume:

use Genome;
# create the volume in the db
my $v = Genome::Disk::Volume->create(
  hostname => $ENV{GENOME_SYS_ID},
  total_kb => 1000000,  #change to your actual volume size
  disk_status => "active", 
  can_allocate=>1, 
  mount_path => "/opt/gms/$ENV{GENOME_SYS_ID}/fs/$ENV{GENOME_SYS_ID}.2",
  physical_path => "/opt/gms/$ENV{GENOME_SYS_ID}/fs/$ENV{GENOME_SYS_ID}.2"
);
# assign it to all of the possible groups (limit if you choose)
for $group_name (qw/reads info_alignments info_genome_models info_apipe_ref/) {
  $g = Genome::Disk::Group->get(disk_group_name=>$group_name);
  $v->add_assignment(group => $g), "\n";
} 
# commit all changes
UR::Context->commit()

Note that the system expects all volumes used by the system with ID XXX to be mounted under /opt/gms/XXX/fs/SOMEDIR. The actual name of the SOMEDIR directory must be unique, but otherwise can be anything. The default value is just a mirror of the system ID itself. I just appended ".2" to the existing disk name.

julyin commented 8 years ago

Thanks so much for your help gatoravi and sakoht. GrubLord's resolved this issue with our AWS instance now.