Closed harlantwood closed 12 years ago
I'm excited to see this work proceed. Some small points:
I got a version of SFW up and running on Heroku: http://oyp-htw-sfw.herokuapp.com/
with minimal changes: https://github.com/harlantwood/Smallest-Federated-Wiki/commits/heroku
-- Note that this is a test only; none of the major work described above has been done yet. Wiki changes will appear to save correctly, but due to the Heroku "ephemeral filesystem", all changes will be blown away on application restart.
Some notes on the recursive calls:
view
as the site
also work fine, eg: http://oyp-htw-sfw.herokuapp.com/view/welcome-visitors/view/smallest-federated-wikiOf course, in the example above you could substitute view
for the last oyp-htw-sfw.herokuapp.com
-- but not so in a farm situation, eg: http://john.wiki.me/view/johns-page/bob.wiki.me/bobs-page -- if wiki.me was running on a thin server, this would crash.
So the solution seems to be: when the farm page is available to the server locally, don't try to make a remote request for it.
Cool. I added a couple of paragraphs. Editing works fine. How about a we have a free-for-all up there knowing that it will all go away?
Regarding remote requests, I've been tempted to add the short-circuit logic but stopped myself on philosophical grounds, at least until we've exhausted alternatives.
Experts have confirmed that this is a well known problem.
I've reverted to Webrick when I run farms. It does not exhibit this behavior.
Node.js does not exhibit the problem either.
Hmmm, the server side does seem to be the place you need to fix this. I'm not sure the client can really know if two domain names are on the same server. It could compare IP addresses, but that still isn't a guarantee, since a server may be handling many IP addresses.
Remind me why we care about running on thin server?
@WardCunningham I can understand the philosophical objections. I'm using Passenger to run a farm, and that also handles recursive calls just fine.
@GerryG thin is what Heroku uses, so we only care if we want to be Heroku deployable.
I'm not sure we need to support farms on Heroku. Heroku is already a farm.
Well, at least it would be doable as single server. That may be the desirable pattern anyway for a service like heroku, having lots of people deploying small free single servers.
One advantage of not having a farm is that the Heroku instance supports only one user.
Is there some simple way to identify the owner of the Heroku instance from information in their web requests? If so we could dispense with the claim
mechanism which requires setting up (or having) an OpenID from another source.
I took a crack at unrolling the recursive calls (before reading your responses above). Here is the new code: https://github.com/harlantwood/Smallest-Federated-Wiki/compare/480f83c...5238d8f (Note that the state of current code is to get feedback only -- if we went this direction I would expect to DRY it up, etc.)
Because you can't access the Heroku file system directly to add the data/farm
directory, I added a FARM_MODE
environment variable which indicates the same thing.
I added a domain name to the heroku app so I could try out subdomains in farm mode. Recursive calls now work:
Given that x.forkthis.net
and y.forkthis.net
exist in the data/farm
directory (ie they have been visited)
Then recursive calls like http://x.forkthis.net/view/welcome-visitors/y.forkthis.net/welcome-visitors will work as expected.
The only URLs that would still be problematic would be those referencing subdomains that do not yet exist in the data/farm
dir -- eg h t t p : / / x.forkthis.net/view/welcome-visitors/brand-new-subdomin.forkthis.net/welcome-visitors (will crash)
Question: is there ever a case where we create a subdirectory in the data/farm
directory which the server does not own? eg do we ever cache remote servers' pages in our data/farm
directory?
@WardCunningham and @nrn, I can see the advantages of only supporting non-farm mode on Heroku. On the other hand, if we did go in the direction of unrolling recursive calls, there is a significant performance gain, both speed for the end user, and removing the load of the server making HTTP calls to itself.
Interesting idea @WardCunningham about the claim
mechanism on Heroku. The idiomatic way to interact with your Heroku app is through environment variables, so from the command line the app owner could say:
heroku config:set OPEN_ID=http://myidentitysite.me/
I was hoping that we could abandon OpenID when the user clearly has ownership of the Heroku instance. I guess it is still required to "share a secret" between client and server somehow. I'm having trouble thinking of any more convenient way.
I'm not sure I understand, Ward. What would "the user clearly has ownership" mean? It has to mean the possession of some sort of authentication token, typically a login or session.
Share a secret isn't exactly the protocol. The server needs to know who to trust, and it can validate you by knowing a couple of public facts. Using the identity of the user and trusted public key registrars, it can establish a shared secret between the server an whoever possesses the private key for the public identity involved.
How does the server find out the identity of trusted a trusted user? The command-line OPEN_ID pattern seems like one reasonable pattern, but you could also store the identity in the claim operation, now it's in the database, right?
I did a spike on storing and retrieving page data in CouchDB
The Couch document ID is currently the absolute path of the file, eg
/app/data/farm/1.sfw.forkthis.net/pages/new-pages
...which should become the path relative to the app root, eg
data/farm/1.sfw.forkthis.net/pages/new-pages
The Couch document contains just a "data" key containing the file as a string:
1.9.2-p290 :041 > puts $couch.get("/app/data/farm/1.sfw.forkthis.net/pages/new-pages")['data']
{
"title": "new pages",
"story": [
{
"type": "paragraph",
"id": "cfe1dde740b6185e",
"text": "I am a new page"
}
],
"journal": [
...
Next steps:
@WardCunningham and @GerryG, I don't think we need to change the current OpenID strategy for Heroku -- I think the current code will work fine. Even the one-line change I suggested to the OpenID::Store
above I now believe to be unnecessary.
Thanks for taking persistance on.
In our modularization efforts we will want to end up where Local Storage in the browser, flat files on a personal laptop, and recognized document database in the cloud are captured as variations on a theme. The strengths of each are as follows:
Once we have these tiers normalized and robust we can attend to the problem of moving content en mass between tiers. That may be too much to think about now. Better to get robust and then consistent first.
Happy to stay with OpenID now. It's working and there is a new generation of web login coming up behind it.
Spike 2 went smoothly: favicons are now persisted in Couch:
I manually base64 encode the favicons, and store them in Couch with the "path" as the key, exactly the same way as other documents -- happily, I avoided using the Couch "attachment" feature entirely. Live example from the console:
1.9.2-p290 :063 > ap $couch.get("/app/data/farm/z.forkthis.net/status/favicon.png")
#<CouchRest::Document:0x7ff0134c3168
attr_reader :_attributes = {
"_id" => "/app/data/farm/z.forkthis.net/status/favicon.png",
"_rev" => "1-62345e227a77187c779e9c44d57cfe51",
"data" => "iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAYAAABzenr0AAAGDElEQVR4nK2W\nZ1OVVxRGz//IZJJJYmJierFgAenSO9J7772JYsOGigVFRbGgoFiwgQiiiBUV\nsWHDhr13TRxnnuzznreB997xw/0Ha/Z61pnDPA5tgvuRLXA/2gC34zvhemI3\nXE7tgfPpFjh174PT2QMYd74Djj1H4HDpGOwvn4Dd1S7YXeuG7Y1zsLnVA+u+\nyxh75yqs7l2H1YNbsHx0G2Me38Popw8x6vkTjHz5DCNfv4TFmzcY8e4dhv/7\nH4Z9+IihHwHm1b4Bnh118DhUTyBb4XZsO9w6OUgjgTQLkDNtGHeunUAOySDH\nYX/lJOx6u2B77QyBnCeQi7C+TSB3ewnkBiwlkDsY8+Q+Rj97JIM8h8XrV7B4\nSyDv3xPIBzDvtjXw2l8Dr/ZaAtkIj8Ob6RrbJBDXzl1wPdkEl669cO5uJZD9\nBHIQjhcI5OJROFzuJJBTBHIattfPwubmBdj0cZArBHINVvdvwvJhH13jLoE8\nECAvnmLkqxd0jdcY8fYtmG9LFXxaq+HdthZeB9bD82AtuBaPw1sEyPEdmhYJ\nRNFCID2H4XBRaLG/ekrTQiDWfZckLWO5FglE0fKArvGYQEjLq5dgfk2V8G1e\nAZ+WlfDZtxre+9fJIJoW92PKPhq1fZAWp7Ptun1wLco+FC09QsudXt0+ZC3y\nPpj/rkXwb1wCvz3L4LuXQFpXwXsf10Ig0j5kLfI+hBYC6eL7ULS0a1rUfSha\nlH1cEfu4f0NoeXRXAmEB2+dj/M4F8N+9GH5NS+HXzEH0WmokLZ4dm7R9KFrU\nfXAtyj6EFmkfXEtvt7oPrsWaa1H3cRsscOtsBDTMRcAOAtm1kEAqJBDf5uWa\nFnUfdTB3tiy4fjqCtsxE4LY5CNg+F+N3lBOIoqVSaGnhWuR9KFoGZntSybZV\ny/bCwGy5lv7ZspC6yQjeNBVBm0sRtHWWDDJPp0XZR5XYh5ytp5FsXYxk62Ak\nWxa2fgJCaychZCMHmUYgM8C1BDaUCS07F6r74Fp8uRaj2TaYyLbDYLYsfG0e\nwmoKEbqhWAaZAk2Lso9ysY/GCqFlYLbtBrLt/LxsWUR1FiLW5CB8bT6BFBHI\nRITUlUhagutLdfuQtRjNtsZEtk1Gs2VRVamIXJUBDhK+Jhfh6wogtHAQRYu8\nD67FnNnSa8qilyUgakUyoqrSCCQTEauzoWpR96FoEfvoly3XIr2mxrKtN5Ft\nG1jskhjELI1D9LJEAklB5Mp0RFZzEEWL2AfXErJxKsydLYtbFI7YxVGIWRKL\nmMp4RC9PgqSFQPprUfYxMNsyE9lWi2wPGMiWa6F9sITyIMQvCEXcogjEVkTT\nNWLBtUQvT4a6D0WLug/zZcsSy/yRMC8Q8eXBiFsYhrjFkTKIXouyjxzosw2p\nLTGYraalf7b9tMj7YEmzvJA0xxeJc8cjYT6BLAghEK6FQPg+FC3yPtRs1xnI\ndrOBbHcbyLZNy5allLoheaYHkmZ7I7HMj67BQfRaxD6iKxO0fUjZZkv7CDOY\nLddiOFtJiy5bljp1HFKmuyBlBoHM8iQQHxkkQNISz7Wo+4iDubNl6SW2SJvi\ngNRpTkgpdUHyDHcCUbT4Cy3lIdo+FC0Ds+33mhrLdt4n2bKMYiukT7JG2mQ7\nAnGUQVw1Leo+gsQ+zJwtyyocicwJY5AxkYPYEIg9uJbU6c5Cy0xP3T4CkMC1\nGM02zUS2XMun2bLsvKHIKhiBzKJRyJBAxkLTIvYhtBDIHB+hxYzZspzsP5GT\n+zey84YRiAWBjEZGsaWkJb3ETrcPWYvRbKNMZJtrNFuWl/ELcrN+BwfJzv0H\nWfnDIbRwEEWLHWlxFFo+J9ulBrJdbSBbek1Zfupg5KUPQV7GrwTyB3Jy/oKq\nRd2HrEXex6fZ+pvINtFEtsVghUnfoSD5e+Sn/kggPyM38ze6BgdRtMj74Fom\nWsNQtklGso01km2ELltWFP8VChO+QUHSIBSk/ID8tJ8gaZFAhJZsSYuyD122\nXMs05/7ZlumzDRXZVhjKVnyCWHHMF5gQ+yWK4r9GYeK3dI1ByE8ZTCBDoO6D\na8kdqtuH+bL9H1M2VJkTgpNBAAAAAElFTkSuQmCC\n"
}
...
@WardCunningham the 3 tiers make perfect sense, although I am a complete n00b on local storage, in this app and in general. Some thoughts on syncing between cloud and desktop:
Persisting identity claims in Couch should be easy. The last remaining juicy item is the Couch implementation of /recent-changes.json
and /global-changes.json
.
I had originally thought of making a new db for each subdomain in the farm, but I think the way that couch wants it done is a single db for the whole farm, and a "view" to allow us to pull the data for a particular subdomain; or to pull the top n changes for each subdomain, as we now do in ruby from the filesystem.
So, next steps:
/recent-changes.json
and /global-changes.json
Caution: I was experiencing some problem with /global-changes.json where an interaction with the page module caused copies of pages to be stored one level up from where they were meant to be stored. And this was when READING the pages!! That's why I commented out the code until I could think through the path name construction.
You'll find awkward code around data_root and page that became that way early in development of rspec tests. On the server side alone there are three paths to storage, not counting your new work:
Also note that Page is more an abstraction of storage than an abstraction of a page.
The mechanisms employed to effect these choices have accreted under various circumstances and are overdue for a rethinking. I mention Local Storage because we have the same sort of accretion present on the client side. When you refactor to add another branch to this choice tree you will face the question of where/how to encode the choice and whether to dig into the artificial complexity that exists there now.
I wish I could say that we left you a better place to work. It is what it is. Don't feel any need to take on more of this refactoring than necessary either. But if you do go after a deeper refactoring, be aware of all the cases and the ultimate desire to have some alignment of abstractions between client and server.
(@nrn has rethought some of this in the express implementation too.)
Thanks Ward for expanding on the current state of page storage. Question: do we need the "non-farm" storage location at all? It seems like a "non-farm" SFW instance could be defined simply as a server at which only one domain points, eg mydomain.com. All of the pages would be stored at:
/data/farm/mydomain.com/data/page
The site would be redefined as a farm if I simply pointed the DNS for *.mydomain.com at the same server:
/data/farm/x.mydomain.com/data/page
/data/farm/y.mydomain.com/data/page
/data/farm/z.mydomain.com/data/page
-- or is there something I am missing here? How concerned would we be about breaking compatibility for existing sites with a change like this BTW?
There is a little bit of logic that gets turned on or off by the farm logic. For example if you direct example.com and www.example.com to the same server, then as a farm it would be two sites, while without farm they would be one in the same.
The 3rd spike is complete. Everything is now persisted to Couch. We are using couch "views" to request the recent-changes.json for the current site in a farm, or the default site in non-farm mode.
If you want to browse recent commits, they are here: https://github.com/harlantwood/Smallest-Federated-Wiki/commits/couchdb
local-identity
and openid.identity
are persisted to Couch. I turned off the OpenID gem writing to the filesystem at all, as it gets confused and crashes when Heroku erases the open ID data directory on application restart. Current version:
def openid_consumer
@openid_consumer ||= OpenID::Consumer.new(session, nil)
end
This is called "stateless mode" by the OpenID gem. They say: "Stateless mode may be slower, put more load on the OpenID provider, and trusts the provider to keep you safe from replay attacks." (from https://github.com/openid/ruby-openid/blob/master/lib/openid/consumer.rb)
Next step: brainstorm and create a simple architecture for swapping in either filesystem-based or couchdb-based persistence.
Ideas so far: the Page
and Server
classes could have a @store
variable, of type FederatedWiki::Store::File
or FederatedWiki::Store::CouchDB
. The store
would have a few methods:
@store.put_page # pages contain special metadata like a timestamp
@store.put_blob # for favicons
@store.put # for arbitrary strings, eg local-identity
...and analogous get
methods. Something like that. The interface should fall out of the similar parts of the current file store and CouchDB store. Any ideas very welome.
Very cool, I'm looking forward to implementing the couch stuff on node when this is finished, now that I'll have a reference for all the hard parts :)
I created a pull request for the CouchDB storage: #204. This is the bulk of the code that will get us Heroku-able. I'll create another pull request or two for the Heroku-specific code once the Couch code is happily integrated.
That's great @nrm that you want to Couch up the node server too! I found Couch very pleasant to work with.
If someone has been thinking of digging into Harlan's work on CouchDB, reading through this issue and beefing up the ReadMe would be a good way to start.
With the Couch work merged into master, I am tackling the last major item needed for Heroku: handling of recursive calls with the thin web server. I've done this in a minimally invasive way: only if you set the ENV variable
SINGLE_THREADED_SERVER=true
then we look for sites locally on the current server. The code:
https://github.com/harlantwood/Smallest-Federated-Wiki/compare/4d1fc53...670298d0
Note that this diff includes some refactoring as well.
@WardCunningham, take a look at the minor README changes in this diff, and let me know what you see that still needs beefing up -- I certainly want these changes to be usable and accessible to the community. When the Heroku work is all merged to master, I will also add a heroku section to the "Hosting and Installation Guide" wiki page.
I'm still investigating one other pathway before creating a pull request -- it seems that there are ways to handle async requests on thin using EventMachine. From the Heroku docs:
The herokuapp.com routing stack can be used for async or multi-threaded apps that wish to handle more than one connection simultaneously. Ruby webservers such as Goliath, Thin (with a suitable web framework such as Async Sinatra), or your own custom EventMachine web process are some examples.
This is a bit of a deep rabbithole, not sure if it's worth it or not...
Boy, it is too late in the day to read those docs. One simple solution is to not run farms on Heroku. We're not yet running them on EC2 to avoid these issues there too.
Hm. I was running farms with no problem on EC2 under Passenger. I added good installation docs to the "Hosting and Installation Guide" wiki page for this setup.
I would like to get farms up and running on Heroku, at least for my own purposes -- hopefully I can do it cleanly and non-intrusively enough that you're happy to pull the changes into master too.
After spending a little time with EventMachine
, async_sinatra
, and friends, I think I'm going to leave this powerful but complex territory for another day, and create a pull request based on the diff I referenced above.
Created pull request #221, Heroku support.
Are we there yet? Maybe a step-by-step for those unfamiliar with Heroku. It doen't have to be fancy. Here's what I wrote for EC2: http://ward.c2.com/view/welcome-visitors/view/sfw-on-ec2
I wrote up instructions in this page: https://github.com/WardCunningham/Smallest-Federated-Wiki/wiki/Hosting-and-Installation-Guide -- search for "Using Heroku".
I think we are there! Please let me know any issues you encounter, happy to help.
In the hangout today we discussed deploying the app to Heroku. I am opening up this ticket to discuss strategies and options.
The two things that I am aware of to get us to Heroku deployability are:
1) @WardCunningham wrote in #152 "If we got rid of the recursive web service calls then we could go back to thin"
To reproduce the issue in question, I started a thin server:
And then hit these URLs:
The first two work fine, but the last one loads the "xxx" left panel, but a blank panel for the "yyy" page on the right. Furthermore, the thin server is borked, and must be aggressively killed and restarted.
If there are other kinds of recursive call issues with thin, please add them to this issue.
2) Heroku has a read-only file system, so we would have to write data somewhere else -- this could be a NoSQL data store or Amazon S3, for example.
A sample dump of data stored on the file system inside a farm instance:
Which basically boils down to:
The open ID data we can write to the Heroku app's ./tmp directory (which is standard practice, at least when using openID via omniauth on Heroku) eg:
For the rest of the files, my hope is to be somewhat simplistic: if we use NoSQL, just treat it as a key-value store:
Some options for non-local storage:
I would really like the whole heroku package to be free for low-volume sites -- the simplicity of deploying an instance to Heroku + freeness could create a major uptick in SFW deployments.