This repository contains the Shelby GT API server code.
Sitemaps are generated and stored on the API server, since it already has all the Rails models available for reading information easily from the database.
But since the sitemaps all point to shelby.tv URLs instead of api.shelby.tv URLs, a redirect is maintained in the shelby.tv nginx config file to make it easy to submit a sitemap index URL that matches the shelby.tv domain.
Robots.txt has also been modified to specify the sitemap index file. Both api.shelby.tv and shelby.tv are verified domains inside of Google Webmaster Tools.
The main goal during generation is to iterate over all videos in the Shelby database and add URLs to the sitemap for each video's SEO page.
The sitemap_generator
gem (https://github.com/kjvarga/sitemap_generator) is used to create the sitemap index and sitemap files.
The code for generating the sitemap is at shelby_gt/config/sitemap.rb
. The code here is mostly straightforward -- the only important concept to grasp is that iterating over all the videos in the database would consume too much memory if the mongomapper identity map were allowed to cache all the objects. So the code explicitly clears the monogmapper identity map to make sure memory usage is kept to a minimum.
The basic process of generating the sitemap is:
ssh gt@api.shelby.tv
cd api/current
RAILS_ENV=production rake sitemap:refresh:no_ping
Originally both web (just URL) and video sitemap files were generated. Video sitemaps must contain much more data -- the video title, description, thumbnail URL, etc.
Generating the video sitemap takes much more time and storage space.
Google can just as easily parse the video metadata out from the page using the metadata from the player's iframe, or using metadata on the containing page in RDFa, Facebook Share, or schema.org format.
So for large amounts of videos, it seems preferable to only generate a web sitemap and trust the search engines to correctly parse the metadata to be included in video search.
The sitemap files currently are generated into and stored inside api/current/public/system
. This public/system
directory stays constant throughout API server updates, so the sitemap files don't need to be regenerated all the time.
For a since of storage requirments -- 18.5 million URLs currently take roughly 560 MB of storage for the sitemap files.
An example runtime for generating web-only (no video) sitemaps is:
Sitemap stats: 18,483,514 links / 370 sitemaps / 400m42s
One could make incremental sitemap additions much faster by storing the mongodb id of the last video processed in a file. The sitemap generation code could be altered to read in this id and only add new videos to the existing sitemap.
The C APIs provide a more efficient way of querying the Shelby database for substantial amounts of data, processing that data, and returning it as a JSON data structure to the calling code.
Rails, accessing mongodb through mongomapper, must convert the mongodb bson data into a Ruby (Rails) data structure. Eventually this goes through other manipulations to convert the Ruby data structure back into JSON for output.
The C API code avoids one translation of the data by keeping the mongodb data in its native bson format. It also is more efficient at doing the JSON conversion, yielding overall much improved performance over Rails APIs performing the same tasks.
The C code itself is stored in the badly named cpp
directory (it was named when there was still C++ code in the mix as well).
Source code is in cpp/source
or cpp/lib
. Sconstruct
and the site_scons
directory contain the scons code for building. scons
is what the cool kids use nowadays -- only old school folks use make
.
Two of the libraries used by the C APIs (mongo-c-driver
and yajl
) are actually git submodules. To initialize and update the git submodules, one must do the following:
git submodule init
from the top level of shelby_gt
git submodule update
from the top level of shelby_gt
These commands must complete successfully to guarantee functionality. If you see any errors during this process, you must debug and resolve them.
These commands are safe to run multiple times, so if you've forgotten whether or not you've done so on a particular tree, you can always run them again.
It should be safe to run these commands from either inside the vagrant VM or just from your Mac, as long as git is installed and working in both places.
Once the submodules have been initialized and updated, the next step is ensuring that you have scons
available inside your vagrant VM. Run this command to install scons
:
sudo apt-get install scons
Once this successfully completes, inside your vagrant VM, do the following:
cd shelby_gt/cpp
scons
The C APIs should build successfully.
To clean up the build tree completely, removing all build output, do:
cd shelby_gt/cpp
scons -c
Part of the building process is that final binaries are copied into cpp/bin
. These are being checked into the git tree at the moment so that compilation doesn't have to happen during API server deployment.
The build process also automatically generates debug binaries with complete debugging symbols for use with gdb
. These are located in the cpp/build
directory after running scons
and are named with the API executable name with -debug
appended.
The mongo-c-driver
is a very simple mongodb
driver. One feature it lacks is the ability to automatically balance requests between the primary and secondary servers in a replica set. It also has issues during primary failover if the primary machine is completely unreachable, since it waits for the HTTP timeout of the first machine in its replica set list before trying other machines.
To get around these issues, the C APIs connect to locally running mongos
processes that are designed to handle the connections to replica sets and shards, doing proper failover and load-balancing.
Also, it's important to note that the C APIs all take an environment parameter which determines the database servers to use. This allows the same binary to be used for connecting to development, test, and production databases.
The main remaining hack is in building the mrjson
library and having it properly consume the yajl
library headers.
Inside of cpp/lib/mrjson/yajl-hacks
is a directory full of symlinks to header files that get populated with git submodule update
. The mrjson
library consumes the yajl
header files from this location to avoid some silly problems with paths.
Normally when yajl
is installed as a system library, this isn't a problem. But in our case, we compile all C API executables as static programs, both for ease-of-use as well as launch speed, so consuming the correct version of yajl
and its headers from the source tree is preferrable.
The C APIs are executed from within Rails Metal Controllers. Mostly these controllers are just taking URL parameters and passing them to the C APIs.
The C APIs are just executed using backticks to execute the programs and capture the output. Launching C APIs is somewhat slow (~30-50 ms), but some time is shaved off by using Rails Metal Controllers, which load less Rails code and save ~30ms over regular Rails controllers.
Because of the current launch time for C APIs, it only makes sense to convert APIs that have significant runtimes in Rails. See the next section for modifications that could be made to make almost any APIs faster in C.
Note that having the controllers contain the word "Metal" in them makes them easy to identify in NewRelic performance stats.
In an ideal world, the C APIs would be even faster by making them be continuously-running single API-servers that communicate with nginx over FastCGI.
Basically, FastCGI support would be added to each API program, and it would be run as a long-running daemon process on the API server machine. Requests to these C APIs would bypass Rails entirely.
The main problem with this approach is w.r.t. authentication and cookies. Rails by default uses its own Marshal format for cookies, for which there doesn't seem to be an existing C library that's easy to use.
However, other folks have handled split Rails/C authentication by changing the default Rails serialization format to another format easily understood by multiple languages -- MessagePack (http://msgpack.org/) seems to be a popular option.
In addition to this, the C APIs would need extra work to ensure no memory leaks, become multithreaded if necessary to handle more concurrent requests, etc.