Split the server for greater scalability

shankari commented 7 years ago

The server scalability had deteriorated to the point where we were unable to run the pipeline even once per day. While part of this is probably just the way we are using mongodb, part of it is also that the server resources are running out.

So I turned off the pipeline around a month ago (last run was on 2017-10-24 21:41:18).

Now, I want to re-provision with a better, split architecture, and reserved instances for lower costs.

shankari commented 7 years ago

Here are the current servers that e-mission is running.

aws-otp-server: m3.large
aws-nominatim: m3.large
habitica-server: m3.large
aws-webapp: m3.xlarge

The OTP and nominatim servers seem to be fine. habitica server sometimes has registration issues (https://github.com/e-mission/e-mission-server/issues/522), but doesn't seem to be related to performance.

The biggest issue is in the webapp. The performance of the webapp + server (without the pipeline running), seems acceptable. So the real issue is the pipeline + the database running on the same server. To be reasonable, we should probably split the server into three parts.

database
webapp
pipeline (backend)

Technically, the pipeline can later become a really small launcher for serverless computation if that's the architecture that we choose to go with.

For now, we want a memory optimized instance for the database, since mongodb caches most results in memory. The webapp and pipeline can probably remain as general-purpose instances, but a bit more powerful.

shankari commented 7 years ago

wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346752431, we probably want the following:

- aws-otp-server: m3.large/m4.large
- aws-nominatim: m3.large/m4.large
- habitica-server: m3.large/m4.large

- aws-em-webapp: m3.xlarge/m4.xlarge
- aws-em-analysis: m3.xlarge/m4.xlarge
- aws-em-mongodb: m4.2xlarge/r3.xlarge/r4.xlarge/r3.2xlarge/r4.2xlarge

shankari commented 7 years ago

Looking at the configuration in greater detail:

for the m3.large/m4.large decision: the m3* series comes with SSD storage large = 32GB, but m4* only supports EBS. So we have to pay extra for storage for the m4* series. So it would be vastly preferable to use the m3 series, at least for the 3 standalone systems which have to include their own data

Instance Type	vCPU	Memory (GiB)	Storage (GB)	Networking Performance
m4.large	2	8	EBS Only	Moderate
m4.xlarge	4	16	EBS Only	High
m3.large	2	7.5	1 x 32 SSD	Moderate
m3.xlarge	4	15	2 x 40 SSD	High

for the database, the difference between the r3* and r4* series seems similar - e.g.

Instance	vCPU	RAM	Network	Local storage
r4.xlarge	4	30.5	Up to 10 Gigabit	EBS-Only
r4.2xlarge	8	61	Up to 10 Gigabit	EBS-Only
r3.xlarge	4	30.5	Moderate	1 x 80
r3.2xlarge	8	61	Moderate	1 x 160

In this case, though, since the database is already on an EBS disk, the overhead should be low.

shankari commented 7 years ago

EBS storage costs are apparently unpredictable, because we pay for both storage and I/O. https://www.quora.com/Whats-cons-and-pros-for-EBS-based-AMIs-vs-instance-store-based-AMIs Some people actively advise against using EBS. And of course, the instance based storages also have a ton of ephemeral storage and mostly work (except the habitica server) work off static datasets . So for the otp, habitica and nominatim servers, it is pretty much a no-brainer to use the m3 instances.

shankari commented 7 years ago

unsure whether m3* instances are available for reserved pricing, though. https://aws.amazon.com/ec2/pricing/reserved-instances/pricing/ and the IOPS pricing only applies to provisioned IOPS instances. https://aws.amazon.com/ebs/pricing/

General purpose instances are 10 cents/GB-month. So the additional cost for going from *3 -> *4 is:

m3.large -> m4.large: 32 * 0.1 = max $3.2/month
m3.xlarge -> m4.xlarge: 80 * 0.1 = max $8/month
r3.large -> r4.large: 80 * 0.1 = max $8/month
m3.xlarge -> m4.xlarge: 160 * 0.1 = max $160/month

So the additional cost is minimal.

Also, all the documentation says that instance storage is ephemeral, but I know for a fact that when I shut down and restart my m3 instances, the data in the root volume is retained. I do see that apparently all AMIs are currently launched with EBS root volumes by default https://stackoverflow.com/a/36688645/4040267 and this is consistent with what I see in the console.

and except for the special database EB2 instance, are typically size 8GB. Does this means that m3 instances now include EBS storage by default? Am I paying for them? I guess so, but 8GB is so small (< 10 cents a month max) that I probably don't notice.

Also, it looks like the EBS instances also do have emphemeral storage (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/RootDeviceStorage.html). So we should go with the `3instances if there are reserved instances that support them - otherwise, we should go with4*` instances - the difference in both cost and functionality is negligible compared to the savings of the reserved instance.

shankari commented 7 years ago

wrt ephemeral storage for instances, they can apparently be added at the time the instance is launched (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/add-instance-store-volumes.html)

You can specify the instance store volumes for your instance only when you launch an instance. > You can't attach instance store volumes to an instance after you've launched it.

shankari commented 7 years ago

from

So we should go with the 3 instances if there are reserved instances that support them - otherwise, we should go with 4 instances - the difference in both cost and functionality is negligible compared to the savings of the reserved instance.

There are reserved instances that support every single kind of on-demand instance including *3*.

shankari commented 7 years ago

I looked at one m3 instance and one m4 instance and they both seem to be identical - one block device, which is the root device and is EBS.

`m4.large`	`m3.large`

shankari commented 7 years ago

Asked a question on stackoverflow https://serverfault.com/questions/885042/m3-instances-have-root-ebs-volume-by-default-so-now-what-is-the-difference-betw

But empirically, it looks like there is ephemeral storage on m3 instances but not on m4. So the m3 instance has a 32 GB /dev/xvdb, but the m4 instance does not. So why would you use m4 instead of m3? More storage is always good, right?

m3

ubuntu@ip-10-157-135-115:~$ sudo fdisk -l

Disk /dev/xvda: 8589 MB, 8589934592 bytes
255 heads, 63 sectors/track, 1044 cylinders, total 16777216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

    Device Boot      Start         End      Blocks   Id  System
/dev/xvda1   *       16065    16771859     8377897+  83  Linux

Disk /dev/xvdb: 32.2 GB, 32204390400 bytes
255 heads, 63 sectors/track, 3915 cylinders, total 62899200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

ubuntu@ip-10-157-135-115:~$ mount | grep ext4
/dev/xvda1 on / type ext4 (rw)

m4

$ sudo fdisk -l
Disk /dev/xvda: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xea059137

Device     Boot Start      End  Sectors Size Id Type
/dev/xvda1 *     2048 16777182 16775135   8G 83 Linux

ubuntu@ip-172-30-0-54:~$ mount | grep ext4
/dev/xvda1 on / type ext4 (rw,relatime,discard,data=ordered)

shankari commented 7 years ago

I am going to create m3.* reserved instances instead of m4.* instances across the board. For the r3.* versus r4.*, there is actually some question since the r4.* instance has better network, which is important for a database.

Note that the EBS volume that hosts the database is currently associated with 9216 IOPS. Is that used or proviosioned? Let's check. According to the docs:

baseline performance is 3 IOPS per GiB, with a minimum of 100 IOPS and a maximum of 10000 IOPS.

The volume uses 3072 GB, so this is 3072 3 = 9216 = the baseline performance. Let us see the actual performance. No more than 2 IOPS. But of course, we weren't running the pipeline. I am tempted to go with `r4.` for the database server, just to be on the safe side.

shankari commented 7 years ago

Given those assumptions, the monthly budget for one installation is:

- aws-otp-server: m3.large ($50) so we have storage
- aws-nominatim: m3.large ($50)
- habitica-server: m3.large ($50)

- aws-em-webapp: m3.xlarge ($90)
- aws-em-analysis: m3.xlarge ($90)
- aws-em-mongodb: r4.2xlarge ($245)

Storage:

- 3072 GB * 0.1 /GB = $307 (biggest expense by far, likely to grow bigger going forward, need to check causes of growth, but may be unavoidable)
- 40 GB * 0.1 / GB = $4 (probably want to put the e-mission server configuration on persistent storage)
- logs can stay on ephemeral storage, which we will have access to given planned m3.* creation

So current total per month:

$150 shared infrastructure,
$425 compute
$310 storage, increasing every month

$885 per month, increasing as we get more storage

When I provision the servers for the eco-escort project, the costs will go up by

$425 compute
$310 storage, increasing every month

$735 per month, increasing as we get more storage

to $885 + $735 = $1620 per month.

Storage details

Current mounts on the server:

From the UI, EBS block devices are

/dev/sda1
/dev/sdd
/dev/sdf

$ mount  | grep ext4
/dev/xvda1 on / type ext4 (rw,discard)
/dev/xvdd on /home/e-mission type ext4 (rw)
/dev/mapper/xvdb on /mnt type ext4 (rw)
/dev/mapper/xvdc on /mnt/logs type ext4 (rw)
/dev/mapper/xvdf on /mnt/e-mission-primary-db type ext4 (rw)

$ df -h
Filesystem        Size  Used Avail Use% Mounted on
/dev/xvda1        7.8G  5.2G  2.2G  71% /
/dev/xvdd         7.8G  326M  7.1G   5% /home/e-mission
/dev/mapper/xvdb   37G   14G   22G  39% /mnt
/dev/mapper/xvdc   37G   19G   17G  54% /mnt/logs
/dev/mapper/xvdf  3.0T  141G  2.7T   5% /mnt/e-mission-primary-db

$ sudo fdisk -l

Disk /dev/xvda: 8589 MB, 8589934592 bytes
    Device Boot      Start         End      Blocks   Id  System
/dev/xvda1   *       16065    16771859     8377897+  83  Linux

Disk /dev/xvdb: 40.3 GB, 40256929792 bytes
Disk /dev/xvdb doesn't contain a valid partition table

Disk /dev/xvdc: 40.3 GB, 40256929792 bytes
Disk /dev/xvdc doesn't contain a valid partition table

Disk /dev/xvdd: 8589 MB, 8589934592 bytes
Disk /dev/xvdd doesn't contain a valid partition table

Disk /dev/xvdf: 3298.5 GB, 3298534883328 bytes
Disk /dev/xvdf doesn't contain a valid partition table

Disk /dev/mapper/xvdb: 40.3 GB, 40254832640 bytes
Disk /dev/mapper/xvdb doesn't contain a valid partition table

Disk /dev/mapper/xvdc: 40.3 GB, 40254832640 bytes
Disk /dev/mapper/xvdc doesn't contain a valid partition table

Disk /dev/mapper/xvdf: 3298.5 GB, 3298532786176 bytes
Disk /dev/mapper/xvdf doesn't contain a valid partition table

So it looks like we have 3 EBS devices:

/ which primarily has the OS, and /tmp/

2.4G    /home
1.7G    /tmp
974M    /usr
391M    /var

$ du -sh /home/*
308M    /home/e-mission
2.1G    /home/ubuntu

$ du -sh /home/ubuntu/*
1.6G    /home/ubuntu/anaconda
393M    /home/ubuntu/Anaconda2-4.0.0-Linux-x86_64.sh
4.0K    /home/ubuntu/gencert
4.0K    /home/ubuntu/tmp

/home/e-mission which primarily has some logs

$ du -sm /home/e-mission/*
1       /home/e-mission/app_store_review_test.stdinoutlog
1       /home/e-mission/Berkeley_sections.stdinout.log
1       /home/e-mission/iphone_2_test.stdinoutlog
1       /home/e-mission/lost+found
1       /home/e-mission/migration.log
2       /home/e-mission/moves_collect.stdinoutlog
2       /home/e-mission/pipeline.stdinoutlog
1       /home/e-mission/pipeline_with_perf.log
1       /home/e-mission/precompute_results.stdinoutlog
65      /home/e-mission/remotePush.stdinoutlog
240     /home/e-mission/silent_ios_push.stdinoutlog

/mnt/e-mission-primary-db which has the database

And we have two ephemeral volumes:

/mnt, which has the e-mission server install
/mnt/logs which has the periodic logs

shankari commented 6 years ago

wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346866854,

I am going to create m3. reserved instances instead of m4. instances across the board. For the r3. versus r4., there is actually some question since the r4.* instance has better network, which is important for a database.

It turns out that m4. is actually cheaper than m3. (https://serverfault.com/a/885060/437264). The difference for large is $24.09/month (m3.large = $69.35, m4.large = 45.26), which is enough to pay for the equivalent EBS storage is $3/month. https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346766952

and we can add ephemeral disks to m4* instances for free when we create them. That settles it, going with m4*.

shankari commented 6 years ago

Creating a staging environment first. This can be the open data environment used by the test phones. Since this is an open data environment, we need an additional server that runs the public ipython notebook server. We can't re-use the analysis server since we need to have a read-only connection to the database.

There is now a new m5 series, so we can just get a head start by deploying to that. It's about the same price, but has much greater EBS bandwidth.

Turns out that we can't create ephemeral storage for these instances, though. I went to the Add Storage tab and tried to add a volume, and the only option was the http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html

We also need to set up a VPC between the servers so that the database cannot be accessed from the general internet. It looks like the VPC is free as long as we don't need a VPN or a NAT. Theoretically, though, we can just configure the incoming security policy for mongodb, even without a VPC. https://aws.amazon.com/vpc/pricing/

I have created:

aws-op-webapp: m5.xlarge, 40GB storage ($90) (54.196.134.233)
aws-op-analysis: m5.xlarge, 40GB storage ($90) (52.87.159.49)
aws-op-public: m5.xlarge, 40GB storage ($90) (52.87.159.49)
aws-op-database: r4.2xlarge, 3 TB storage ($245) (34.201.243.180)

shankari commented 6 years ago

After deploying the servers, we need to set them up. The first big issue in setup is securing the database server. We will use two methods to secure the server:

we will restrict network access to the database port to the associated servers
we will turn on authentication and access control

Restricting network access (at least naively) is pretty simple - we just need to set up the firewall correctly. Later, we should explore the creation of a VPC for greater security.

wrt authentication, the viable options are:

SCRAM-SHA-1
MONGODB-CR
x.509

The first two are both username/password based authentication, which I am really reluctant to use. There is no classic public-key authentication mechanism.

shankari commented 6 years ago

I am reluctant to use the username/password based authentication because then I would need to store the password in a filesystem somewhere and make sure to copy/configure it every time. But in terms of attack vector, it seems around the same as public-key based authentication.

If the attacker gets access to the connecting hosts (webapp or analysis), it seems like she would have access to both the password and the private key.

The main differences are:

if the attacker gets access to the place where we have stored the passwords for the long-term, the password based solution is compromised, although the public key solution is not. We can avoid this by storing the password securely, just like the private key to the webapp.
if the public key is sent to the database over plaintext, the database is not compromised, but it is, if the database is compromised. We can avoid this by encrypting connections between the database and the webapp. This may also allow us to use x.509 based authentication, which is pretty close to public key authentication.

shankari commented 6 years ago

We can avoid this by encrypting connections between the database and the webapp. This may also allow us to use x.509 based authentication

We can do this, but we need to get SSL certificates for TLS-based encryption. I guess a self-signed certificate should be fine, since the mongodb is only going to be connected to the analysis and webapp hosts, which we control. But we can also probably avoid it if all communication is through an internal subnet on the VPC.

Basically, it seems like there are multiple levels of hardening possible:

configure incoming and outgoing connections in the firewall, no auth Ease of use: 6 (easy, simple security group UI) Security: 1 (weak, since data transfer flows over the public internet without encryption)
listen only to the private IP, all communication to/from the database is in the VPC, no auth Ease of use: 4 (can set up VPC via UI) Security: 5 (pretty good, since all unencrypted data flow is internal. The only attack vector is if the hacker somehow compromises any of the services. Once this is done, she can either connect to the database directly, or run a packet sniffer on the network
listen only to the private IP, all communication to/from the database is in the VPC, SSL certificates used, no auth Ease of use: 1 (need to get SSL certificates and setup a bunch of configuration) Security: 7 (pretty close to optimal, since even packet sniffers can't see anything)

Adding authentication

If we use option 2+ above, adding authentication does not appear to provide very much additional protection from external hackers. Assuming no firewall bugs, if a hacker wants to access the database, they need to first hack into one of the service hosts to generate the appropriate source header. And if they do that, they can always just see the auth credentials in the config file.

However, it can prevent catastrophic issues if there really is a firewall or VPC bug, and a hacker is able to inject malicious packets that purportedly come from the service hosts. Unless there is an encryption bug, moving to option (3) will harden the option further.

Authentication seems most useful when it is combined with Rule-based Access Control. RBAC can be used to separate read-only exploration (e.g. on a public server) from read-write computation. But it can go beyond that - we can make the webapp write to the timeseries and read-only from the aggregate, but make the analysis server read-only from the timeseries but write to to the analysis database

shankari commented 6 years ago

wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-350631885, given the tradeoffs articulared, I have decided to go with option (2) with no auth.

Concrete proposal

listen only to the private IP, all communication to/from the database is in the VPC, no auth Ease of use: 4 (can set up VPC via UI) Security: 5 (pretty good, since all unencrypted data flow is internal.

shankari commented 6 years ago

It looks like all instances created in the past year are assigned to the same VPC and the same subnet in the VPC (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/default-vpc.html). In general, we don't want to share the subnet with other servers, because then if a hacker got access to one of the other subnets, they could packet sniff all the data and potentially figure out the data. For the open data servers, this may be OK since the data is open, and we have firewall restrictions on where we can get messages from.

But what about packet spoofing and potentially deleting data? Let's just make another (small) subnet.

shankari commented 6 years ago

I can't seem to find a way to list all the instances in a particular subnet. Filed https://serverfault.com/questions/887552/aws-how-do-i-find-the-list-of-instances-associated-with-a-particular-subnet

shankari commented 6 years ago

Ok, just to experiment with this for the future, we will set up a small subnet that hosts only the database and the analysis server.

From https://aws.amazon.com/vpc/faqs/ The minimum size of a subnet is a /28 (or 14 IP addresses.) for IPv4. Subnets cannot be larger than the VPC in which they are created.

multi-tier website, with the web servers in a public subnet and the database servers in a private subnet.

So basically, this scenario: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html

wait, analysis server cannot be in private subnet then because it needs to talk to external systems such as habitica and the real time bus etc. We should really split analysis server into two subnets too - external facing and internal facing. But since that will require some additional software restructuring, let's just put it in the public subnet for now.

I won't provision a NAT gateway for now - will explore ipv6-only options which will not require a (paid) NAT gateway and can use the (free) egress-only-internet gateway. http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/egress-only-internet-gateway.html

shankari commented 6 years ago

Ok so followed the VPC wizard for scenario 2 and created

aws-op-vpc
aws-op-public-subnet, aws-op-private-subnet
a NAT gateway and an egress-only internet gateway, and
aws-op-public-route, aws-op-private-route

Only aws-op-private-subnet has IPv6 enabled.

aws-op-public-route was associated with aws-op-public-subnet, but aws-op-private-route was marked as main and not associated with any subnet. That is consistent with

In this scenario, the VPC wizard updates the main route table used with the private subnet, and creates a custom route table and associates it with the public subnet.

In this scenario, all traffic from each subnet that is bound for AWS (for example, to the Amazon EC2 or Amazon S3 endpoints) goes over the Internet gateway. The database servers in the private subnet can't receive traffic from the Internet directly because they don't have Elastic IP addresses. However, the database servers can send and receive Internet traffic through the NAT device in the public subnet.

Any additional subnets that you create use the main route table by default, which means that they are private subnets by default. If you want to make a subnet public, you can always change the route table that it's associated with.

shankari commented 6 years ago

The default wizard configuration turns off "Auto-assign Public IP" because the assumption appears to be that we will use elastic IPs. Testing this scenario by editing the network interface for our provisioned servers and then turning it on later or manually assigning IPs.

shankari commented 6 years ago

Service instances

Turns out you can't edit the network interface, but you can create a new one and attach the volumes.

Before migration

IP: 54.196.134.233 Able to ssh in

Migrate

Create a new m5.xlarge instance
attach it to the aws-op-vpc, aws-op-public-subnet and override the assignment settings for public IP and ipv6.
create security groups for the different kinds of instances
- webapp
  - incoming SSH from home and HTTPS from the eecs hostname redirect
  - all outgoing traffic to both 0.0.0.0/0 and ::/0 (seems like we can tighten this)
- analysis
  - incoming SSH from home
  - all outgoing traffic to both 0.0.0.0/0 and ::/0 (seems like we can tighten this)
- public
  - incoming ssh from home and ports 8888 - 9999 for ipython notebook
  - all outgoing traffic to both 0.0.0.0/0 and ::/0
- database
  - incoming ssh from webapp and mongodb from webapp, analysis and public
  - outgoing traffic to all ports on webapp, analysis and public over ipv4 (seems like we should add routes for patches)

After migration

Can ssh directly to all three public-facing servers
attached non-root EB2 instances were also deleted! Good we figured this out now! Created instances and attached them

shankari commented 6 years ago

Database instance

Migration

Recreating instance, putting it into the private subnet, no assigned ipv4 address. It looks like after the instance is created, I can add a new private ip address, but not a public one.

Ah!

You can only use the auto-assign public IPv4 feature for a single, new network interface with the device index of eth0. For more information, see Assigning a Public IPv4 Address During Instance Launch.

No matter - that is what I want.

Ensure that the security group allows ssh from the webserver.

Try to ssh from the webserver. Works!

Try to ssh from the analysis server. Doesn't work!

Try to ssh to the private address from outside Obviously doesn't work.

Tighten up the outbound rules on all security groups to be consistent with http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html

Couple of modifications needed for this to work.

outbound ssh rule from the webapp to the database server to allow us to log in

DNS resolution needed to be enabled for the VPC Looking at http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-dns.html, DNS resolution is supposed to be enabled for VPCs created through the wizard, but is off for our VPC although it was created using the wizard

$ ping www.google.com
PING www.google.com (172.217.13.228) 56(84) bytes of data.
64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=1 ttl=45 time=1.61 ms
64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=2 ttl=45 time=1.61 ms
64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=3 ttl=45 time=1.59 ms
64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=4 ttl=45 time=1.64 ms
^C

DNS servers only support ipv4, so if we want to access the internet from the private subnet, we need to continue using the NAT gateway instance that the wizard setup for us.
```
[ec2-user@ip-192-168-1-100 ~]$ ping www.google.com
PING www.google.com (172.217.8.4) 56(84) bytes of data.
<HANGS>
^C
--- www.google.com ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4081ms
```
This is because the incoming rules for the nat only supported the default security group. Changing it to the database security group caused everything to start working.

shankari commented 6 years ago

Attaching the database instances back, and then I think that setup is all done. I'm a bit unhappy about the NAT, but figuring out how to do DNS for ipv6 addresses is a later project, I think.

shankari commented 6 years ago

Cannot attach instances because they are in a different availability zone. Per http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumes.html you need to migrate the instances to a different zone using their snapshots.

shankari commented 6 years ago

and our volumes don't have snapshots. creating snapshots to explore this option... can't create snapshot - selected it and nothing happened. so it looks like reserved iops volumes have their snapshots under "snapshots", not linked to the volume.

Restoring.... That worked. Attached the three volumes back to the database.

Getting started with code now...

shankari commented 6 years ago

Main code changes required:

support database hostname as part of configuration. There's already a field for this, but we should actually use it. Or potentially split it out into it's own conf file.
split out all the public stuff since it was really kludgy and is going to be on a separate server anyway

shankari commented 6 years ago

changes to server done (https://github.com/e-mission/e-mission-server/pull/535) now it is time to deploy!

shankari commented 6 years ago

installing mongodb now...

shankari commented 6 years ago

configuring mongodb to listen to the other private IP addresses, requires appending them to bindIp. Since we also have an ipv6 address on the private subnet, we also need to do this workaround. https://dba.stackexchange.com/questions/173781/bind-mongodb-to-ipv4-as-well-as-ipv6/192406

shankari commented 6 years ago

began configuring the webapp

added two EBS stores to the instance
found that they were now /dev/nvme0n1p. This is true of m5 instances. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html
mounted them as /code and /logs
git cloned the code onto /code
configured the database by setting the url to the ipv6 address of the database (note that database is running without auth)
try to connect - it works. Will experiment further with connections and permissions tomorrow.

shankari commented 6 years ago

connections worked! wrote a script to setup and teardown permissions for a basic scenario

shankari commented 6 years ago

while waiting for that script to finish running, collecting the public data for the test phones. 4 android phones, 4 iPhones, 4 JF phones.

We can use the bin/debug/extract_timeline_for_day_range_and_user.py but need to figure out the date range. Or we can just use the time range that extends before I started work on the project to now, and it should work.

The current webapp server was set up in August 2014, and the first big data collection is from Dec 2015. The emails to Kattt about the phones were from Nov 2015. So starting from January 2015 should be good enough.

shankari commented 6 years ago

Create a file with the 8 public email addresses (attached). Mapped the email addresses to uuids.

$ ./e-mission-py.bash bin/public/extract_challenge_uuids.py file_public_ids file_public_ids.uuid
DEBUG:root:Mapped email ucb.sdb.android.1@gmail.com
 to uuid e471711e-bd14-3dbe-80b6-9c7d92ecc296
DEBUG:root:Mapped email ucb.sdb.android.2@gmail.com
 to uuid fd7b4c2e-2c8b-3bfa-94f0-d1e3ecbd5fb7
DEBUG:root:Mapped email ucb.sdb.android.3@gmail.com
 to uuid 86842c35-da28-32ed-a90e-2da6663c5c73
DEBUG:root:Mapped email ucb.sdb.androi.4@gmail.com
 to uuid 3bc0f91f-7660-34a2-b005-5c399598a369
DEBUG:root:Mapped email ucb.sdb.iphone.1@gmail.com
 to uuid 079e0f1a-c440-3d7c-b0e7-de160f748e35
DEBUG:root:Mapped email ucb.sdb.iphone.2@gmail.com
 to uuid c76a0487-7e5a-3b17-a449-47be666b36f6
DEBUG:root:Mapped email cub.sdb.iphone.3@gmail.com
 to uuid c528bcd2-a88b-3e82-be62-ef4f2396967a
DEBUG:root:Mapped email ucb.sdb.iphone.4@gmail.com
 to uuid 95e70727-a04e-3e33-b7fe-34ab19194f8b
DEBUG:root:Mapped email nexus7itu01@gmail.com
 to uuid 70968068-dba5-406c-8e26-09b548da0e4b
DEBUG:root:Mapped email nexus7itu02@gmail.com
 to uuid 6561431f-d4c1-4e0f-9489-ab1190341fb7
DEBUG:root:Mapped email motoeitu01@gmail.com
 to uuid 92cf5840-af59-400c-ab72-bab3dcdf7818
DEBUG:root:Mapped email motoeitu02@gmail.com
 to uuid 93e8a1cc-321f-4fa9-8c3c-46928668e45d

Then extracted from the uuid list.

$ ./e-mission-py.bash bin/debug/extract_timeline_for_day_range_and_user.py file_public_ids.uuid 2015-01-01 2017-12-31 /tmp/public_data/dump 2>&1 | tee /tmp/dump_public_data.log
....

shankari commented 6 years ago

While waiting for the extraction to complete, setup users on the webapp.

(emission) ubuntu@ip-192-168-0-80:/code/e-mission-server$ ./e-mission-py.bash setup/db_auth.py -s
Created admin user, result = {'ok': 1.0}
At current state, list of users = {'users': [{'_id': 'admin.<...admin...>', 'user': '<...admin...>', 'db': 'admin', 'roles': [{'role': 'userAdminAnyDatabase', 'db': 'admin'}]}], 'ok': 1.0}
Created RW user, result = {'ok': 1.0}
At current state, list of users = {'users': [{'_id': 'admin.<...admin...>', 'user': '<...admin...>', 'db': 'admin', 'roles': [{'role': 'userAdminAnyDatabase', 'db': 'admin'}]}, {'_id': 'admin.<...rw...>', 'user': '<...rw...>', 'db': 'admin', 'roles': [{'role': 'readWrite', 'db': 'Stage_database'}]}], 'ok': 1.0}
Created new role, result = {'ok': 1.0}
At current state, list of roles = {'roles': [{'role': 'createIndex', 'db': 'Stage_database', 'isBuiltin': False, 'roles': [], 'inheritedRoles': [], 'privileges': [{'resource': {'db': 'Stage_database', 'collection': ''}, 'actions': ['createIndex']}], 'inheritedPrivileges': [{'resource': {'db': 'Stage_database', 'collection': ''}, 'actions': ['createIndex']}]}], 'ok': 1.0}
Created RO user, result = {'ok': 1.0}
At current state, list of users = {'users': [{'_id': 'admin.<...admin...>', 'user': '<...admin...>', 'db': 'admin', 'roles': [{'role': 'userAdminAnyDatabase', 'db': 'admin'}]}, {'_id': 'admin.<...ro...>', 'user': '<...ro...>', 'db': 'admin', 'roles': [{'role': 'readWrite', 'db': 'Stage_database'}]}, {'_id': 'admin.<...rw...>', 'user': '<...rw...>', 'db': 'admin', 'roles': [{'role': 'readWrite', 'db': 'Stage_database'}]}], 'ok': 1.0}

shankari commented 6 years ago

Now, configure the webapp as follows:

Changing

conf/log/webserver.conf.sample
conf/net/api/webserver.conf.sample
conf/net/ext_service/habitica.json.sample
conf/net/ext_service/nominatim.json.sample
conf/net/ext_service/push.json.sample
conf/storage/db.conf.sample

Unused + reason

conf/clients/testclient.settings.json.sample: no client-specific functionality
conf/log/intake.conf.sample: not going to run the intake pipeline
conf/net/auth/google_auth.json.sample: going to use `skip` auth mode
conf/net/auth/openid_auth.json.sample: going to use `skip` auth mode
conf/net/auth/token_list.json.sample: going to use `skip` auth mode
conf/net/ext_service/googlemaps.json.sample: not using googlemaps for anything
conf/net/keys.json.sample: not using SSL

Remaining conf files removed as part of https://github.com/e-mission/e-mission-server/pull/537

shankari commented 6 years ago

Now, turn on auth on the database, restart and ensure that

logging in with proper credentials works

In [1]: import emission.core.get_database as edb

In [2]: edb.get_timeseries_db().find()
Out[2]: <pymongo.cursor.Cursor at 0x7f292cb2fbe0>

In [3]: edb.get_timeseries_db().find().count()
Out[3]: 0

logging in with improper credentials does not work

In [6]: conn = pymongo.MongoClient("mongodb://<...admin...>:<...admin-pw...>@<hostname>/admin?authMechanism=SCRAM-SHA-1")
   ...:

In [7]: conn.Stage_database.Stage_timeseries.find().count()
---------------------------------------------------------------------------
OperationFailure                          Traceback (most recent call last)
<ipython-input-7-9583ab7660a8> in <module>()
...
OperationFailure: not authorized on Stage_database to execute command { count: "Stage_timeseries", query: {} }

So auth is configured correctly!

shankari commented 6 years ago

Tried to access the web page. Couple of fixes:

Had to update six to version 1.11. It is currently in conda, but upgrading it will switch to a custom version of anaconda. Might just do this with a separate pip command instead. PR forthcoming...
Had to install bower (https://stackoverflow.com/questions/21491996/installing-bower-on-ubuntu)
Had to enable outbound requests to 9418 from aws-op-webapp to support retrieving bower packages.

shankari commented 6 years ago

works! sent email to rise-support asking for a DNS name.

shankari commented 6 years ago

wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-351561385, worked except for iphone1, for which the cursor timed out. Retrieving it in stages (2015, 2016 and 2017 separately) instead.

$ ./e-mission-py.bash bin/debug/extract_timeline_for_day_range_and_user.py 079e0f1a-c440-3d7c-b0e7-de160f748e35 2015-01-01 2015-12-31 /tmp/public_data/dump_2015
$ ./e-mission-py.bash bin/debug/extract_timeline_for_day_range_and_user.py 079e0f1a-c440-3d7c-b0e7-de160f748e35 2016-01-01 2016-12-31 /tmp/public_data/dump_2016
...

Originally failed logs + retry attached. dump_public_data.continue.1.log.gz dump_public_data.log.gz

shankari commented 6 years ago

I also need to migrate over the pipeline state so that we don't spend a lot of time re-running the pipeline for the existing data. Alternatively, we could delete the analysis results and re-run the pipeline to debug the pipeline running, which would also give us a sense of the scalability of the new split server.

This is only data for 12 phones, and only intermittent data at that. And the brazil open data stuff is really only for a week, so pretty small too. Let's go ahead and re-run the pipeline.

We could always dump and re-restore the values if we needed to. This is the whole reproducibility aspect after all :)

Dun-dun-dun!

shankari commented 6 years ago

continuing with https://github.com/e-mission/e-mission-server/issues/530#issuecomment-351747819, setup supervisord. It only runs on python 2.7, so created a new python virtual env to run it.

$ conda create -n py27 python=2.7 anaconda
$ source activate py27
$ pip install supervisor
$ echo_supervisord_conf
$ echo_supervisord_conf > supervisord.conf
$ vim supervisord.conf
(add emissionpy section)
$ supervisord -c supervisord.conf

webapp still runs fine.

shankari commented 6 years ago

setup the filesystem

$ sudo mkfs.ext4 /dev/nvme1n1
$ sudo mkfs.ext4 /dev/nvme2n1

fstab to mount attached EBS volumes

/dev/nvme1n1            /code    ext4   defaults,auto,noatime,exec 0 0
/dev/nvme2n1            /log     ext4   defaults,auto,noatime,noexec 0 0

Then mount them!

$ sudo mount /code
$ sudo mount /log

And change permissions

$ sudo chown -R ubuntu:ubuntu /code
$ sudo chown -R ubuntu:ubuntu /log

shankari commented 6 years ago

Configuration for the analysis server

Changing

conf/log/intake.conf.sample
conf/net/ext_service/habitica.json.sample
conf/net/ext_service/nominatim.json.sample
conf/net/ext_service/push.json.sample
conf/storage/db.conf.sample

Ignored

conf/clients/testclient.settings.json.sample
conf/log/webserver.conf.sample
conf/net/api/webserver.conf.sample
conf/net/auth/google_auth.json.sample
conf/net/auth/openid_auth.json.sample
conf/net/auth/token_list.json.sample
conf/net/ext_service/googlemaps.json.sample

shankari commented 6 years ago

While setting up the public server, we don't want the analysts messing around with the install, so we will run ipython under a separate account (analyst). Since we just really need the access control, this account should have no password (no additional attack vectors). So we will create it as a system user. https://unix.stackexchange.com/questions/56765/creating-an-user-without-a-password

$ sudo adduser \
--system \
--shell /bin/bash \
--gecos 'User for running the notebooks' \
--group \
--disabled-password \
--home /notebooks \
analyst
Adding system user `analyst' (UID 112) ...
Adding new group `analyst' (GID 116) ...
Adding new user `analyst' (UID 112) with group `analyst' ...
Not creating home directory `/home/analyst'.

$ sudo -s -u analyst
$ source activate emission
$ export HOME=/notebooks
$ ./e-mission-jupyter.bash notebook --notebook-dir=/notebooks

shankari commented 6 years ago

Only configuration for the public server is the database, since we won't be running any ongoing services. Set up a password because we are making this available publicly. Using the simple version instead of jupyterhub. http://jupyter-notebook.readthedocs.io/en/stable/public_server.html

shankari commented 6 years ago

There are issues with storing the notebooks in a separate directory. Since the kernel is started in the notebooks directory, conf/storage/db.conf does not exist.

I tried specifying --notebook-dir, and linking the /notebooks directory, but neither of them worked. Modifying some of the paths may help, need to experiment. For now, changed the emission.core.get_database code to use the absolute path (/code/e-mission-server/conf/...)

shankari commented 6 years ago

Now let's see whether we can make a template for this setup. It seems like it would be pretty useful for a more compehensive data collection effort, and there should be some option that allows you to create a cluster of VMs.

Aha, there is AWS CloudFormation, in which you can use a designer to create a virtual appliance. https://aws.amazon.com/cloudformation/

Can I create one from my current configuration? Apparently, I can use cloudformer, which I have to install into an instance in EC2 and then it can create a template for me. Let's create this now. http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-using-cloudformer.html

This adds new IAM roles, which I need to delete after I am done. First attempt at creating failed - second succeeded.

shankari commented 6 years ago

Logged in - it is now "analysing resources"

e-mission / e-mission-docs