Closed shankari closed 5 years ago
Here are the current servers that e-mission is running.
The OTP and nominatim servers seem to be fine. habitica server sometimes has registration issues (https://github.com/e-mission/e-mission-server/issues/522), but doesn't seem to be related to performance.
The biggest issue is in the webapp. The performance of the webapp + server (without the pipeline running), seems acceptable. So the real issue is the pipeline + the database running on the same server. To be reasonable, we should probably split the server into three parts.
Technically, the pipeline can later become a really small launcher for serverless computation if that's the architecture that we choose to go with.
For now, we want a memory optimized instance for the database, since mongodb caches most results in memory. The webapp and pipeline can probably remain as general-purpose instances, but a bit more powerful.
wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346752431, we probably want the following:
- aws-otp-server: m3.large/m4.large
- aws-nominatim: m3.large/m4.large
- habitica-server: m3.large/m4.large
- aws-em-webapp: m3.xlarge/m4.xlarge
- aws-em-analysis: m3.xlarge/m4.xlarge
- aws-em-mongodb: m4.2xlarge/r3.xlarge/r4.xlarge/r3.2xlarge/r4.2xlarge
Looking at the configuration in greater detail:
m3.large/m4.large
decision: the m3*
series comes with SSD storage large = 32GB
, but m4*
only supports EBS. So we have to pay extra for storage for the m4*
series. So it would be vastly preferable to use the m3
series, at least for the 3 standalone systems which have to include their own dataInstance Type | vCPU | Memory (GiB) | Storage (GB) | Networking Performance |
---|---|---|---|---|
m4.large | 2 | 8 | EBS Only | Moderate |
m4.xlarge | 4 | 16 | EBS Only | High |
m3.large | 2 | 7.5 | 1 x 32 SSD | Moderate |
m3.xlarge | 4 | 15 | 2 x 40 SSD | High |
r3*
and r4*
series seems similar - e.g. Instance | vCPU | RAM | Network | Local storage |
---|---|---|---|---|
r4.xlarge | 4 | 30.5 | Up to 10 Gigabit | EBS-Only |
r4.2xlarge | 8 | 61 | Up to 10 Gigabit | EBS-Only |
r3.xlarge | 4 | 30.5 | Moderate | 1 x 80 |
r3.2xlarge | 8 | 61 | Moderate | 1 x 160 |
In this case, though, since the database is already on an EBS disk, the overhead should be low.
EBS storage costs are apparently unpredictable, because we pay for both storage and I/O. https://www.quora.com/Whats-cons-and-pros-for-EBS-based-AMIs-vs-instance-store-based-AMIs Some people actively advise against using EBS. And of course, the instance based storages also have a ton of ephemeral storage and mostly work (except the habitica server) work off static datasets . So for the otp, habitica and nominatim servers, it is pretty much a no-brainer to use the m3 instances.
unsure whether m3*
instances are available for reserved pricing, though.
https://aws.amazon.com/ec2/pricing/reserved-instances/pricing/
and the IOPS pricing only applies to provisioned IOPS instances.
https://aws.amazon.com/ebs/pricing/
General purpose instances are 10 cents/GB-month. So the additional cost for going from *3
-> *4
is:
m3.large -> m4.large
: 32 * 0.1 = max $3.2/monthm3.xlarge -> m4.xlarge
: 80 * 0.1 = max $8/monthr3.large -> r4.large
: 80 * 0.1 = max $8/monthm3.xlarge -> m4.xlarge
: 160 * 0.1 = max $160/monthSo the additional cost is minimal.
Also, all the documentation says that instance storage is ephemeral, but I know for a fact that when I shut down and restart my m3 instances, the data in the root volume is retained. I do see that apparently all AMIs are currently launched with EBS root volumes by default https://stackoverflow.com/a/36688645/4040267 and this is consistent with what I see in the console.
and except for the special database EB2 instance, are typically size 8GB. Does this means that m3 instances now include EBS storage by default? Am I paying for them? I guess so, but 8GB is so small (< 10 cents a month max) that I probably don't notice.
Also, it looks like the EBS instances also do have emphemeral storage (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/RootDeviceStorage.html). So we should go with the `3instances if there are reserved instances that support them - otherwise, we should go with
4*` instances - the difference in both cost and functionality is negligible compared to the savings of the reserved instance.
wrt ephemeral storage for instances, they can apparently be added at the time the instance is launched (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/add-instance-store-volumes.html)
You can specify the instance store volumes for your instance only when you launch an instance. > You can't attach instance store volumes to an instance after you've launched it.
from
So we should go with the 3 instances if there are reserved instances that support them - otherwise, we should go with 4 instances - the difference in both cost and functionality is negligible compared to the savings of the reserved instance.
There are reserved instances that support every single kind of on-demand instance including *3*
.
I looked at one m3 instance and one m4 instance and they both seem to be identical - one block device, which is the root device and is EBS.
m4.large |
m3.large |
---|---|
Asked a question on stackoverflow https://serverfault.com/questions/885042/m3-instances-have-root-ebs-volume-by-default-so-now-what-is-the-difference-betw
But empirically, it looks like there is ephemeral storage on m3 instances but not on m4. So the m3 instance has a 32 GB /dev/xvdb
, but the m4 instance does not. So why would you use m4 instead of m3? More storage is always good, right?
ubuntu@ip-10-157-135-115:~$ sudo fdisk -l
Disk /dev/xvda: 8589 MB, 8589934592 bytes
255 heads, 63 sectors/track, 1044 cylinders, total 16777216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Device Boot Start End Blocks Id System
/dev/xvda1 * 16065 16771859 8377897+ 83 Linux
Disk /dev/xvdb: 32.2 GB, 32204390400 bytes
255 heads, 63 sectors/track, 3915 cylinders, total 62899200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
ubuntu@ip-10-157-135-115:~$ mount | grep ext4
/dev/xvda1 on / type ext4 (rw)
$ sudo fdisk -l
Disk /dev/xvda: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xea059137
Device Boot Start End Sectors Size Id Type
/dev/xvda1 * 2048 16777182 16775135 8G 83 Linux
ubuntu@ip-172-30-0-54:~$ mount | grep ext4
/dev/xvda1 on / type ext4 (rw,relatime,discard,data=ordered)
I am going to create m3.*
reserved instances instead of m4.*
instances across the board.
For the r3.*
versus r4.*
, there is actually some question since the r4.*
instance has better network, which is important for a database.
Note that the EBS volume that hosts the database is currently associated with 9216 IOPS. Is that used or proviosioned? Let's check. According to the docs:
baseline performance is 3 IOPS per GiB, with a minimum of 100 IOPS and a maximum of 10000 IOPS.
The volume uses 3072 GB, so this is 3072 3 = 9216 = the baseline performance. Let us see the actual performance. No more than 2 IOPS. But of course, we weren't running the pipeline. I am tempted to go with `r4.` for the database server, just to be on the safe side.
Given those assumptions, the monthly budget for one installation is:
- aws-otp-server: m3.large ($50) so we have storage
- aws-nominatim: m3.large ($50)
- habitica-server: m3.large ($50)
- aws-em-webapp: m3.xlarge ($90)
- aws-em-analysis: m3.xlarge ($90)
- aws-em-mongodb: r4.2xlarge ($245)
Storage:
- 3072 GB * 0.1 /GB = $307 (biggest expense by far, likely to grow bigger going forward, need to check causes of growth, but may be unavoidable)
- 40 GB * 0.1 / GB = $4 (probably want to put the e-mission server configuration on persistent storage)
- logs can stay on ephemeral storage, which we will have access to given planned m3.* creation
So current total per month:
$150 shared infrastructure,
$425 compute
$310 storage, increasing every month
$885 per month, increasing as we get more storage
When I provision the servers for the eco-escort project, the costs will go up by
$425 compute
$310 storage, increasing every month
$735 per month, increasing as we get more storage
to $885 + $735 = $1620
per month.
Current mounts on the server:
From the UI, EBS block devices are
/dev/sda1
/dev/sdd
/dev/sdf
$ mount | grep ext4
/dev/xvda1 on / type ext4 (rw,discard)
/dev/xvdd on /home/e-mission type ext4 (rw)
/dev/mapper/xvdb on /mnt type ext4 (rw)
/dev/mapper/xvdc on /mnt/logs type ext4 (rw)
/dev/mapper/xvdf on /mnt/e-mission-primary-db type ext4 (rw)
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 7.8G 5.2G 2.2G 71% /
/dev/xvdd 7.8G 326M 7.1G 5% /home/e-mission
/dev/mapper/xvdb 37G 14G 22G 39% /mnt
/dev/mapper/xvdc 37G 19G 17G 54% /mnt/logs
/dev/mapper/xvdf 3.0T 141G 2.7T 5% /mnt/e-mission-primary-db
$ sudo fdisk -l
Disk /dev/xvda: 8589 MB, 8589934592 bytes
Device Boot Start End Blocks Id System
/dev/xvda1 * 16065 16771859 8377897+ 83 Linux
Disk /dev/xvdb: 40.3 GB, 40256929792 bytes
Disk /dev/xvdb doesn't contain a valid partition table
Disk /dev/xvdc: 40.3 GB, 40256929792 bytes
Disk /dev/xvdc doesn't contain a valid partition table
Disk /dev/xvdd: 8589 MB, 8589934592 bytes
Disk /dev/xvdd doesn't contain a valid partition table
Disk /dev/xvdf: 3298.5 GB, 3298534883328 bytes
Disk /dev/xvdf doesn't contain a valid partition table
Disk /dev/mapper/xvdb: 40.3 GB, 40254832640 bytes
Disk /dev/mapper/xvdb doesn't contain a valid partition table
Disk /dev/mapper/xvdc: 40.3 GB, 40254832640 bytes
Disk /dev/mapper/xvdc doesn't contain a valid partition table
Disk /dev/mapper/xvdf: 3298.5 GB, 3298532786176 bytes
Disk /dev/mapper/xvdf doesn't contain a valid partition table
So it looks like we have 3 EBS devices:
/
which primarily has the OS, and /tmp/
2.4G /home
1.7G /tmp
974M /usr
391M /var
$ du -sh /home/*
308M /home/e-mission
2.1G /home/ubuntu
$ du -sh /home/ubuntu/*
1.6G /home/ubuntu/anaconda
393M /home/ubuntu/Anaconda2-4.0.0-Linux-x86_64.sh
4.0K /home/ubuntu/gencert
4.0K /home/ubuntu/tmp
/home/e-mission
which primarily has some logs
$ du -sm /home/e-mission/*
1 /home/e-mission/app_store_review_test.stdinoutlog
1 /home/e-mission/Berkeley_sections.stdinout.log
1 /home/e-mission/iphone_2_test.stdinoutlog
1 /home/e-mission/lost+found
1 /home/e-mission/migration.log
2 /home/e-mission/moves_collect.stdinoutlog
2 /home/e-mission/pipeline.stdinoutlog
1 /home/e-mission/pipeline_with_perf.log
1 /home/e-mission/precompute_results.stdinoutlog
65 /home/e-mission/remotePush.stdinoutlog
240 /home/e-mission/silent_ios_push.stdinoutlog
/mnt/e-mission-primary-db
which has the databaseAnd we have two ephemeral volumes:
/mnt
, which has the e-mission server install/mnt/logs
which has the periodic logswrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346866854,
I am going to create m3. reserved instances instead of m4. instances across the board. For the r3. versus r4., there is actually some question since the r4.* instance has better network, which is important for a database.
It turns out that m4. is actually cheaper than m3. (https://serverfault.com/a/885060/437264). The difference for large
is $24.09/month (m3.large
= $69.35, m4.large
= 45.26), which is enough to pay for the equivalent EBS storage is $3/month.
https://github.com/e-mission/e-mission-server/issues/530#issuecomment-346766952
and we can add ephemeral disks to m4*
instances for free when we create them.
That settles it, going with m4*
.
Creating a staging environment first. This can be the open data environment used by the test phones. Since this is an open data environment, we need an additional server that runs the public ipython notebook server. We can't re-use the analysis server since we need to have a read-only connection to the database.
There is now a new m5
series, so we can just get a head start by deploying to that. It's about the same price, but has much greater EBS bandwidth.
Turns out that we can't create ephemeral storage for these instances, though. I went to the Add Storage tab and tried to add a volume, and the only option was the http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html
We also need to set up a VPC between the servers so that the database cannot be accessed from the general internet. It looks like the VPC is free as long as we don't need a VPN or a NAT. Theoretically, though, we can just configure the incoming security policy for mongodb, even without a VPC. https://aws.amazon.com/vpc/pricing/
I have created:
After deploying the servers, we need to set them up. The first big issue in setup is securing the database server. We will use two methods to secure the server:
Restricting network access (at least naively) is pretty simple - we just need to set up the firewall correctly. Later, we should explore the creation of a VPC for greater security.
wrt authentication, the viable options are:
SCRAM-SHA-1
MONGODB-CR
x.509
The first two are both username/password based authentication, which I am really reluctant to use. There is no classic public-key authentication mechanism.
I am reluctant to use the username/password based authentication because then I would need to store the password in a filesystem somewhere and make sure to copy/configure it every time. But in terms of attack vector, it seems around the same as public-key based authentication.
If the attacker gets access to the connecting hosts (webapp or analysis), it seems like she would have access to both the password and the private key.
The main differences are:
We can avoid this by encrypting connections between the database and the webapp. This may also allow us to use x.509 based authentication
We can do this, but we need to get SSL certificates for TLS-based encryption. I guess a self-signed certificate should be fine, since the mongodb is only going to be connected to the analysis and webapp hosts, which we control. But we can also probably avoid it if all communication is through an internal subnet on the VPC.
Basically, it seems like there are multiple levels of hardening possible:
configure incoming and outgoing connections in the firewall, no auth Ease of use: 6 (easy, simple security group UI) Security: 1 (weak, since data transfer flows over the public internet without encryption)
listen only to the private IP, all communication to/from the database is in the VPC, no auth Ease of use: 4 (can set up VPC via UI) Security: 5 (pretty good, since all unencrypted data flow is internal. The only attack vector is if the hacker somehow compromises any of the services. Once this is done, she can either connect to the database directly, or run a packet sniffer on the network
listen only to the private IP, all communication to/from the database is in the VPC, SSL certificates used, no auth Ease of use: 1 (need to get SSL certificates and setup a bunch of configuration) Security: 7 (pretty close to optimal, since even packet sniffers can't see anything)
If we use option 2+ above, adding authentication does not appear to provide very much additional protection from external hackers. Assuming no firewall bugs, if a hacker wants to access the database, they need to first hack into one of the service hosts to generate the appropriate source header. And if they do that, they can always just see the auth credentials in the config file.
However, it can prevent catastrophic issues if there really is a firewall or VPC bug, and a hacker is able to inject malicious packets that purportedly come from the service hosts. Unless there is an encryption bug, moving to option (3) will harden the option further.
Authentication seems most useful when it is combined with Rule-based Access Control. RBAC can be used to separate read-only exploration (e.g. on a public server) from read-write computation. But it can go beyond that - we can make the webapp write to the timeseries and read-only from the aggregate, but make the analysis server read-only from the timeseries but write to to the analysis database
wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-350631885, given the tradeoffs articulared, I have decided to go with option (2) with no auth.
listen only to the private IP, all communication to/from the database is in the VPC, no auth Ease of use: 4 (can set up VPC via UI) Security: 5 (pretty good, since all unencrypted data flow is internal.
It looks like all instances created in the past year are assigned to the same VPC and the same subnet in the VPC (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/default-vpc.html). In general, we don't want to share the subnet with other servers, because then if a hacker got access to one of the other subnets, they could packet sniff all the data and potentially figure out the data. For the open data servers, this may be OK since the data is open, and we have firewall restrictions on where we can get messages from.
But what about packet spoofing and potentially deleting data? Let's just make another (small) subnet.
I can't seem to find a way to list all the instances in a particular subnet. Filed https://serverfault.com/questions/887552/aws-how-do-i-find-the-list-of-instances-associated-with-a-particular-subnet
Ok, just to experiment with this for the future, we will set up a small subnet that hosts only the database and the analysis server.
From https://aws.amazon.com/vpc/faqs/ The minimum size of a subnet is a /28 (or 14 IP addresses.) for IPv4. Subnets cannot be larger than the VPC in which they are created.
multi-tier website, with the web servers in a public subnet and the database servers in a private subnet.
So basically, this scenario: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html
wait, analysis server cannot be in private subnet then because it needs to talk to external systems such as habitica and the real time bus etc. We should really split analysis server into two subnets too - external facing and internal facing. But since that will require some additional software restructuring, let's just put it in the public subnet for now.
I won't provision a NAT gateway for now - will explore ipv6-only options which will not require a (paid) NAT gateway and can use the (free) egress-only-internet gateway. http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/egress-only-internet-gateway.html
Ok so followed the VPC wizard for scenario 2 and created
aws-op-vpc
aws-op-public-subnet
, aws-op-private-subnet
aws-op-public-route
, aws-op-private-route
Only aws-op-private-subnet
has IPv6 enabled.
aws-op-public-route
was associated with aws-op-public-subnet
, but aws-op-private-route
was marked as main
and not associated with any subnet. That is consistent with
In this scenario, the VPC wizard updates the main route table used with the private subnet, and creates a custom route table and associates it with the public subnet.
In this scenario, all traffic from each subnet that is bound for AWS (for example, to the Amazon EC2 or Amazon S3 endpoints) goes over the Internet gateway. The database servers in the private subnet can't receive traffic from the Internet directly because they don't have Elastic IP addresses. However, the database servers can send and receive Internet traffic through the NAT device in the public subnet.
Any additional subnets that you create use the main route table by default, which means that they are private subnets by default. If you want to make a subnet public, you can always change the route table that it's associated with.
The default wizard configuration turns off "Auto-assign Public IP" because the assumption appears to be that we will use elastic IPs. Testing this scenario by editing the network interface for our provisioned servers and then turning it on later or manually assigning IPs.
Turns out you can't edit the network interface, but you can create a new one and attach the volumes.
IP: 54.196.134.233 Able to ssh in
m5.xlarge
instanceaws-op-vpc
, aws-op-public-subnet
and override the assignment settings for public IP and ipv6.Ah!
You can only use the auto-assign public IPv4 feature for a single, new network interface with the device index of eth0. For more information, see Assigning a Public IPv4 Address During Instance Launch.
No matter - that is what I want.
Ensure that the security group allows ssh from the webserver.
Try to ssh from the webserver. Works!
Try to ssh from the analysis server. Doesn't work!
Try to ssh to the private address from outside Obviously doesn't work.
Tighten up the outbound rules on all security groups to be consistent with http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html
Couple of modifications needed for this to work.
outbound ssh rule from the webapp to the database server to allow us to log in
DNS resolution needed to be enabled for the VPC Looking at http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-dns.html, DNS resolution is supposed to be enabled for VPCs created through the wizard, but is off for our VPC although it was created using the wizard
$ ping www.google.com
PING www.google.com (172.217.13.228) 56(84) bytes of data.
64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=1 ttl=45 time=1.61 ms
64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=2 ttl=45 time=1.61 ms
64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=3 ttl=45 time=1.59 ms
64 bytes from iad23s61-in-f4.1e100.net (172.217.13.228): icmp_seq=4 ttl=45 time=1.64 ms
^C
DNS servers only support ipv4, so if we want to access the internet from the private subnet, we need to continue using the NAT gateway instance that the wizard setup for us.
[ec2-user@ip-192-168-1-100 ~]$ ping www.google.com
PING www.google.com (172.217.8.4) 56(84) bytes of data.
<HANGS>
^C
--- www.google.com ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4081ms
This is because the incoming rules for the nat only supported the default
security group. Changing it to the database
security group caused everything to start working.
Attaching the database instances back, and then I think that setup is all done. I'm a bit unhappy about the NAT, but figuring out how to do DNS for ipv6 addresses is a later project, I think.
Cannot attach instances because they are in a different availability zone. Per http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumes.html you need to migrate the instances to a different zone using their snapshots.
and our volumes don't have snapshots. creating snapshots to explore this option... can't create snapshot - selected it and nothing happened. so it looks like reserved iops volumes have their snapshots under "snapshots", not linked to the volume.
Restoring.... That worked. Attached the three volumes back to the database.
Getting started with code now...
Main code changes required:
changes to server done (https://github.com/e-mission/e-mission-server/pull/535) now it is time to deploy!
installing mongodb now...
configuring mongodb to listen to the other private IP addresses, requires appending them to bindIp
. Since we also have an ipv6 address on the private subnet, we also need to do this workaround.
https://dba.stackexchange.com/questions/173781/bind-mongodb-to-ipv4-as-well-as-ipv6/192406
began configuring the webapp
/dev/nvme0n1p
. This is true of m5
instances.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html/code
and /logs
/code
connections worked! wrote a script to setup and teardown permissions for a basic scenario
while waiting for that script to finish running, collecting the public data for the test phones. 4 android phones, 4 iPhones, 4 JF phones.
We can use the bin/debug/extract_timeline_for_day_range_and_user.py
but need to figure out the date range. Or we can just use the time range that extends before I started work on the project to now, and it should work.
The current webapp server was set up in August 2014, and the first big data collection is from Dec 2015. The emails to Kattt about the phones were from Nov 2015. So starting from January 2015 should be good enough.
Create a file with the 8 public email addresses (attached). Mapped the email addresses to uuids.
$ ./e-mission-py.bash bin/public/extract_challenge_uuids.py file_public_ids file_public_ids.uuid
DEBUG:root:Mapped email ucb.sdb.android.1@gmail.com
to uuid e471711e-bd14-3dbe-80b6-9c7d92ecc296
DEBUG:root:Mapped email ucb.sdb.android.2@gmail.com
to uuid fd7b4c2e-2c8b-3bfa-94f0-d1e3ecbd5fb7
DEBUG:root:Mapped email ucb.sdb.android.3@gmail.com
to uuid 86842c35-da28-32ed-a90e-2da6663c5c73
DEBUG:root:Mapped email ucb.sdb.androi.4@gmail.com
to uuid 3bc0f91f-7660-34a2-b005-5c399598a369
DEBUG:root:Mapped email ucb.sdb.iphone.1@gmail.com
to uuid 079e0f1a-c440-3d7c-b0e7-de160f748e35
DEBUG:root:Mapped email ucb.sdb.iphone.2@gmail.com
to uuid c76a0487-7e5a-3b17-a449-47be666b36f6
DEBUG:root:Mapped email cub.sdb.iphone.3@gmail.com
to uuid c528bcd2-a88b-3e82-be62-ef4f2396967a
DEBUG:root:Mapped email ucb.sdb.iphone.4@gmail.com
to uuid 95e70727-a04e-3e33-b7fe-34ab19194f8b
DEBUG:root:Mapped email nexus7itu01@gmail.com
to uuid 70968068-dba5-406c-8e26-09b548da0e4b
DEBUG:root:Mapped email nexus7itu02@gmail.com
to uuid 6561431f-d4c1-4e0f-9489-ab1190341fb7
DEBUG:root:Mapped email motoeitu01@gmail.com
to uuid 92cf5840-af59-400c-ab72-bab3dcdf7818
DEBUG:root:Mapped email motoeitu02@gmail.com
to uuid 93e8a1cc-321f-4fa9-8c3c-46928668e45d
Then extracted from the uuid list.
$ ./e-mission-py.bash bin/debug/extract_timeline_for_day_range_and_user.py file_public_ids.uuid 2015-01-01 2017-12-31 /tmp/public_data/dump 2>&1 | tee /tmp/dump_public_data.log
....
While waiting for the extraction to complete, setup users on the webapp.
(emission) ubuntu@ip-192-168-0-80:/code/e-mission-server$ ./e-mission-py.bash setup/db_auth.py -s
Created admin user, result = {'ok': 1.0}
At current state, list of users = {'users': [{'_id': 'admin.<...admin...>', 'user': '<...admin...>', 'db': 'admin', 'roles': [{'role': 'userAdminAnyDatabase', 'db': 'admin'}]}], 'ok': 1.0}
Created RW user, result = {'ok': 1.0}
At current state, list of users = {'users': [{'_id': 'admin.<...admin...>', 'user': '<...admin...>', 'db': 'admin', 'roles': [{'role': 'userAdminAnyDatabase', 'db': 'admin'}]}, {'_id': 'admin.<...rw...>', 'user': '<...rw...>', 'db': 'admin', 'roles': [{'role': 'readWrite', 'db': 'Stage_database'}]}], 'ok': 1.0}
Created new role, result = {'ok': 1.0}
At current state, list of roles = {'roles': [{'role': 'createIndex', 'db': 'Stage_database', 'isBuiltin': False, 'roles': [], 'inheritedRoles': [], 'privileges': [{'resource': {'db': 'Stage_database', 'collection': ''}, 'actions': ['createIndex']}], 'inheritedPrivileges': [{'resource': {'db': 'Stage_database', 'collection': ''}, 'actions': ['createIndex']}]}], 'ok': 1.0}
Created RO user, result = {'ok': 1.0}
At current state, list of users = {'users': [{'_id': 'admin.<...admin...>', 'user': '<...admin...>', 'db': 'admin', 'roles': [{'role': 'userAdminAnyDatabase', 'db': 'admin'}]}, {'_id': 'admin.<...ro...>', 'user': '<...ro...>', 'db': 'admin', 'roles': [{'role': 'readWrite', 'db': 'Stage_database'}]}, {'_id': 'admin.<...rw...>', 'user': '<...rw...>', 'db': 'admin', 'roles': [{'role': 'readWrite', 'db': 'Stage_database'}]}], 'ok': 1.0}
Now, configure the webapp as follows:
conf/log/webserver.conf.sample
conf/net/api/webserver.conf.sample
conf/net/ext_service/habitica.json.sample
conf/net/ext_service/nominatim.json.sample
conf/net/ext_service/push.json.sample
conf/storage/db.conf.sample
conf/clients/testclient.settings.json.sample: no client-specific functionality
conf/log/intake.conf.sample: not going to run the intake pipeline
conf/net/auth/google_auth.json.sample: going to use `skip` auth mode
conf/net/auth/openid_auth.json.sample: going to use `skip` auth mode
conf/net/auth/token_list.json.sample: going to use `skip` auth mode
conf/net/ext_service/googlemaps.json.sample: not using googlemaps for anything
conf/net/keys.json.sample: not using SSL
Remaining conf files removed as part of https://github.com/e-mission/e-mission-server/pull/537
Now, turn on auth on the database, restart and ensure that
In [1]: import emission.core.get_database as edb
In [2]: edb.get_timeseries_db().find()
Out[2]: <pymongo.cursor.Cursor at 0x7f292cb2fbe0>
In [3]: edb.get_timeseries_db().find().count()
Out[3]: 0
In [6]: conn = pymongo.MongoClient("mongodb://<...admin...>:<...admin-pw...>@<hostname>/admin?authMechanism=SCRAM-SHA-1")
...:
In [7]: conn.Stage_database.Stage_timeseries.find().count()
---------------------------------------------------------------------------
OperationFailure Traceback (most recent call last)
<ipython-input-7-9583ab7660a8> in <module>()
...
OperationFailure: not authorized on Stage_database to execute command { count: "Stage_timeseries", query: {} }
So auth is configured correctly!
Tried to access the web page. Couple of fixes:
Had to update six to version 1.11. It is currently in conda, but upgrading it will switch to a custom version of anaconda. Might just do this with a separate pip
command instead. PR forthcoming...
Had to install bower (https://stackoverflow.com/questions/21491996/installing-bower-on-ubuntu)
Had to enable outbound requests to 9418 from aws-op-webapp
to support retrieving bower packages.
works! sent email to rise-support asking for a DNS name.
wrt https://github.com/e-mission/e-mission-server/issues/530#issuecomment-351561385, worked except for iphone1, for which the cursor timed out. Retrieving it in stages (2015, 2016 and 2017 separately) instead.
$ ./e-mission-py.bash bin/debug/extract_timeline_for_day_range_and_user.py 079e0f1a-c440-3d7c-b0e7-de160f748e35 2015-01-01 2015-12-31 /tmp/public_data/dump_2015
$ ./e-mission-py.bash bin/debug/extract_timeline_for_day_range_and_user.py 079e0f1a-c440-3d7c-b0e7-de160f748e35 2016-01-01 2016-12-31 /tmp/public_data/dump_2016
...
Originally failed logs + retry attached. dump_public_data.continue.1.log.gz dump_public_data.log.gz
I also need to migrate over the pipeline state so that we don't spend a lot of time re-running the pipeline for the existing data. Alternatively, we could delete the analysis results and re-run the pipeline to debug the pipeline running, which would also give us a sense of the scalability of the new split server.
This is only data for 12 phones, and only intermittent data at that. And the brazil open data stuff is really only for a week, so pretty small too. Let's go ahead and re-run the pipeline.
We could always dump and re-restore the values if we needed to. This is the whole reproducibility aspect after all :)
Dun-dun-dun!
continuing with https://github.com/e-mission/e-mission-server/issues/530#issuecomment-351747819, setup supervisord. It only runs on python 2.7, so created a new python virtual env to run it.
$ conda create -n py27 python=2.7 anaconda
$ source activate py27
$ pip install supervisor
$ echo_supervisord_conf
$ echo_supervisord_conf > supervisord.conf
$ vim supervisord.conf
(add emissionpy section)
$ supervisord -c supervisord.conf
webapp still runs fine.
setup the filesystem
$ sudo mkfs.ext4 /dev/nvme1n1
$ sudo mkfs.ext4 /dev/nvme2n1
fstab to mount attached EBS volumes
/dev/nvme1n1 /code ext4 defaults,auto,noatime,exec 0 0
/dev/nvme2n1 /log ext4 defaults,auto,noatime,noexec 0 0
Then mount them!
$ sudo mount /code
$ sudo mount /log
And change permissions
$ sudo chown -R ubuntu:ubuntu /code
$ sudo chown -R ubuntu:ubuntu /log
Configuration for the analysis server
conf/log/intake.conf.sample
conf/net/ext_service/habitica.json.sample
conf/net/ext_service/nominatim.json.sample
conf/net/ext_service/push.json.sample
conf/storage/db.conf.sample
conf/clients/testclient.settings.json.sample
conf/log/webserver.conf.sample
conf/net/api/webserver.conf.sample
conf/net/auth/google_auth.json.sample
conf/net/auth/openid_auth.json.sample
conf/net/auth/token_list.json.sample
conf/net/ext_service/googlemaps.json.sample
While setting up the public server, we don't want the analysts messing around with the install, so we will run ipython under a separate account (analyst
). Since we just really need the access control, this account should have no password (no additional attack vectors). So we will create it as a system user.
https://unix.stackexchange.com/questions/56765/creating-an-user-without-a-password
$ sudo adduser \
--system \
--shell /bin/bash \
--gecos 'User for running the notebooks' \
--group \
--disabled-password \
--home /notebooks \
analyst
Adding system user `analyst' (UID 112) ...
Adding new group `analyst' (GID 116) ...
Adding new user `analyst' (UID 112) with group `analyst' ...
Not creating home directory `/home/analyst'.
$ sudo -s -u analyst
$ source activate emission
$ export HOME=/notebooks
$ ./e-mission-jupyter.bash notebook --notebook-dir=/notebooks
Only configuration for the public server is the database, since we won't be running any ongoing services. Set up a password because we are making this available publicly. Using the simple version instead of jupyterhub. http://jupyter-notebook.readthedocs.io/en/stable/public_server.html
There are issues with storing the notebooks in a separate directory. Since the kernel is started in the notebooks directory, conf/storage/db.conf
does not exist.
I tried specifying --notebook-dir
, and linking the /notebooks
directory, but neither of them worked. Modifying some of the paths may help, need to experiment. For now, changed the emission.core.get_database
code to use the absolute path (/code/e-mission-server/conf/...
)
Now let's see whether we can make a template for this setup. It seems like it would be pretty useful for a more compehensive data collection effort, and there should be some option that allows you to create a cluster of VMs.
Aha, there is AWS CloudFormation, in which you can use a designer to create a virtual appliance. https://aws.amazon.com/cloudformation/
Can I create one from my current configuration? Apparently, I can use cloudformer, which I have to install into an instance in EC2 and then it can create a template for me. Let's create this now. http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-using-cloudformer.html
This adds new IAM roles, which I need to delete after I am done. First attempt at creating failed - second succeeded.
Logged in - it is now "analysing resources"
The server scalability had deteriorated to the point where we were unable to run the pipeline even once per day. While part of this is probably just the way we are using mongodb, part of it is also that the server resources are running out.
So I turned off the pipeline around a month ago (last run was on
2017-10-24 21:41:18
).Now, I want to re-provision with a better, split architecture, and reserved instances for lower costs.