Closed ghost closed 12 years ago
rickw: Please Review
metson: Replying to [comment:53 sfoulkes]:
FNAL is requiring us to protect the couch server with a user account. So I have the option of either firewalling it so it can only be accessed from the machine it's running, or having user accounts...
IMHO firewall should be the aim (long term, appreciate it's not appropriate now). We'd have three classes of set ups:
sfoulkes: The problem is I'm not using a couch server for development that's running on my local machine. I'd have to tunnel to the development machine to be able to use futon or debug couch apps. This isn't that big of a deal though, everyone is use to tunneling everything at this point
lat: Replying to [comment:46 sryu]:
Hi Lassi, could you review the code on validation. I am not sure what is the best practice. I didn't use regex since specific directory contains only specific type of files. [attachment:patch-series-sryu.patch:ticket:808 patch-series-sryu.patch]
I would prefer regexp for sanitisation to avoid various weaknesses in the sanitiser you have (e.g. it allows empty directory components which when used with certain python tools results in // which results in path truncation which results in security hole). Something like this (untested):
{{{
htmlpath = re.compile(r"^(09A-Za-z]+/)_09A-Za-z]+.html$") jspath = re.compile(r"^(09A-Za-z]+/)_09A-Za-z]+.js$") csspath = re.compile(r"^(09A-Za-z]+/)*09A-Za-z]+.css$")
path = "/".join(*args) if not re.match(htmlpath, path): raise ... }}}
lat: Replying to [comment:48 rickw]:
I'm using the following regular expressions for validation:
Agreed on one central place for validation regexps and routines, in particular for things like validating dataset names.
Recommend to explicitly anchor regexps in beginning, i.e. add {{{^}}} to the beginning of regexp. This avoids silly mistakes with different semantics when using re.match vs. re.search. (Actually I think your text has the caret, but it's not displayed by trac - I recommend using triple brace syntax for code quotes to avoid these type of problems.)
Not sure about the {1,100}. Why not just +?
You don't need to quote "-", "." or ":".
All regexps should be {{{r"regexp"}}} (leading r on regexp string, i.e. raw encoding) so escapes don't get swallowed by python.
For couch you may need to allow {{{\d+}}} instead of literal 5984. There's some excess parentheses. The following ought to be sufficient:
{{{
couchUrl = re.compile(r'^http://([-a-zA-Z0-9_]+.)+(fnal.gov|cern.ch):\d+') }}}
That said however, do you actually allow user to specify the URL to the couchdb instance? Why would that be necessary? Shouldn't it be a choice of among some specific list of server instances the web server is pre-configured to know about? Why would it accept to talk to some random couch server someone happens to install, even if it's at CERN or FNAL?
lat: Replying to [comment:45 sryu]:
More importantly it should be obvious from code generating whether it's generating an internal link, e.g. to some RequestManager API/web page call, or if it's generating an external link.
It is exernal links in a sense it is not from the same host. But all the link are from the WMCore/WorkQueue APIs. The links are not random but is not previously known either. The links gets from WorkQueue APIs and WMBS APIs which propagates it host link (in some cases adding suffix). I wander whether there still needs to be some verification in javascript end.
Two issues here. Where you are generating an internal link, it should be obvious from the code the link cannot be arbitrary.
The second is that I am somewhat uncomfortable with a design where arbitrary links are possible. Why is that necessary? Why isn't it possible to have a restricted set of known resource providers / link targets? I have a very difficult time understanding why arbitrary URLs are necessary. (I can understand very well that someone thinks they are necessary but that is not what I am interested in; I am looking for explanation why arbitrary linking actually is genuinely, truly needed.)
lat: Replying to [comment:49 swakef]:
ps. Does the cmssw expression allow the for X_X_X_patch10? Not sure we have ever gone that high but... Actually, it does but it also allows 'CMSSW_3_10_10_pre111pre111'.
Yes. We have strange names like X_Y_Z_preNNabc.
I'd be extremely happy with high upstream validators that stop silly names cold, right from the start, and in order for work requests to even be accepted the names have to validate against actually stated naming policy.
lat: Replying to [comment:56 sfoulkes]:
This isn't that big of a deal though, everyone is use to tunneling everything at this point
Agreed, need to tunnel shouldn't stop people these days. I use tunneling everywhere - even within CERN - and it pretty much works just fine.
sryu: Replying to [comment:59 lat]:
Replying to [comment:45 sryu]:
More importantly it should be obvious from code generating whether it's generating an internal link, e.g. to some RequestManager API/web page call, or if it's generating an external link.
It is exernal links in a sense it is not from the same host. But all the link are from the WMCore/WorkQueue APIs. The links are not random but is not previously known either. The links gets from WorkQueue APIs and WMBS APIs which propagates it host link (in some cases adding suffix). I wander whether there still needs to be some verification in javascript end.
Two issues here. Where you are generating an internal link, it should be obvious from the code the link cannot be arbitrary.
The second is that I am somewhat uncomfortable with a design where arbitrary links are possible. Why is that necessary? Why isn't it possible to have a restricted set of known resource providers / link targets? I have a very difficult time understanding why arbitrary URLs are necessary. (I can understand very well that someone thinks they are necessary but that is not what I am interested in; I am looking for explanation why arbitrary linking actually is genuinely, truly needed.)
Link is not completely arbitrary. (I just don't know the what it would be). Maybe it is better for you to give advice if I explain how the links are generated. RequestMgr is running in a certain site waiting for GlobalQueue to contact the service. (GlobalQueue knows where requestManager is but RequesMgr doesn't), GlobalQueue registered itself with its url to RequestMananger. The same for LocalQueue and GlobalQueue interaction. That is where the links coming from. If I have the information of list of sites which GlobalQueue and LocalQueue will be installed. I can't restrict the link to be connected. But I think that checking should be done when the queue gets registered. Not sure whether it is necessary to check again in javascript.
lat: There's two separate issues here. The main location of validation always needs to be in server (python) code; any validation in javascript is purely for user interface, never for security.
From what you explain, whenever user is expected to provide global queue choice, it should be from drop down list of already registered global queue instances -- not a text box for entering a URL. The server should correspondingly validate the selected queue. Of course in this case there's little need to pass URLs here, a simple label ("CERN queue"?) will be sufficient to identify the server. Maybe the URL doesn't even need to be displayed anywhere. I'd see this somewhat like currently selecting PhEDEx instance on the PhEDEx web (you have three choices), or DBS discovery offering drop down list of known DBS instances (you have slightly more choices). If you do display these as URLs somewhere, you would sanitise it normally, but you don't need restrictions.
I agree most of the validation needs to be at the point of registration of the URL. Clearly this is something that needs to be an authorised operation. It shouldn't be possible for anyone to invoke the API which causes request manager to become aware of a global queue. Presumably this API call then remembers the URL and the friendly label. Both would be sanitised to only consist of certain limited patterns (so they can't be arbitrary URLs, only "reasonable" ones), but you don't need to "deep" verify the URLs.
It would be excellent if reading the javascript code it would be manifestly clear what output is getting generated and where it came from. I'd advice against multiple layers of general utilities on top of general utilities which completely hide what you just described.
lat: Hm, maybe I need to clarify something to avoid confusion. I'm speaking separate of validation and sanitisation, and both need to be handled. Javascript should always sanitise all output it generates; we've covered that before. So I was mainly focusing on validation above, and there last layer of validation is always in server code - javascript just helps user interface.
However they are intertwined. If you have a text box, then you need to sanitise whenever you generate HTML that sets the value of that text box. You need to also sanitise any URLs you generate using data from text fields (e.g. if you take some URL root and append text field value to it -> use encodeURIComponent()). Web server will do the validation.
The reason we got here was that from reading the code it wasn't at all clear to me where the URLs came from and why they were getting handled as they were. There was a good explanation above -- it would be very good if the above description was manifestly clear when reading the code too (javascript and/or python).
sryu: Replying to [comment:64 lat]:
Hm, maybe I need to clarify something to avoid confusion. I'm speaking separate of validation and sanitisation, and both need to be handled. Javascript should always sanitise all output it generates; we've covered that before. So I was mainly focusing on validation above, and there last layer of validation is always in server code - javascript just helps user interface.
However they are intertwined. If you have a text box, then you need to sanitise whenever you generate HTML that sets the value of that text box. You need to also sanitise any URLs you generate using data from text fields (e.g. if you take some URL root and append text field value to it -> use encodeURIComponent()). Web server will do the validation.
The reason we got here was that from reading the code it wasn't at all clear to me where the URLs came from and why they were getting handled as they were. There was a good explanation above -- it would be very good if the above description was manifestly clear when reading the code too (javascript and/or python).
Hi Lassi, Thank very much for the explanation advice. So.
Also I want to clarify about the validation and registration of global/local queue service, Global/LocalQueue is registered through running those components with right configuration. There is no web user interface for that However, the api is public and as you said we need some sort of authentication to access the api. Also we will add some sort of verification along side of authentication as you suggest (I will talk to Rick about that).
FYI, the javascript code you were reviewing is monitoring code (which doesn't alter the backend database), links are generated by registered service (Global/LocalQueue) and javascript merely formats them (as you said which I need to sanitize).
Hope I didn't misunderstood your suggestion. Thanks
rickw: Please Review
rickw: I hope these are the last few fixes before ReqMgr is deployable, so as soon as these are in, and the new setup.py is in, I should be able to test full-chain.
lat: The patch includes what looks like an unrelated change to WMCore/RequestManager/RequestDB/Oracle/Create.py.
lat: The change effectively allows arbitrary host names for CouchDB URLs because '.' is allowed in the initial wildcard part. If you want to allow just 'localhost' in addition to hosts at CERN or FNAL, you should restrict the regexp widening to exactly that. I'm reluctant to accept the change as it is.
rickw: Please Review
rickw: OK. I put back the CERN/FNAL restriction, and added the r's in front of each regexp. This path supercedes the last one.
def couchurl(candidate): return check(r'http://(localhost|(([a-zA-Z0-9:\@.-_]){1,100})(.fnal.gov|.cern.ch)):5984', candidate)
For requests made through the web interface, the user has to use the couch we specify. But we support submitting requests directly through JSON/REST, too, and they can specify a Couch URL there, so we need some validation.
rickw: Because production databases use separate read and write accounts, I probably want to factor reqMgrBrowser into read-only and write-enabled components. I guess now is a good time to redo those awkward names, so I'll work on splitting reqMgrBrowser into four pages, "admin", "approve", "assign", and "view", and rename WebRequestSchema to "create".
lat: The updated regexp looks better. Note though it and many other regexps I've seen quote a lot of characters that don't need quoting. That has confused me, and I am not sure why that's done - I've not understood if it means the programmer isn't sure of how regexps work, or it's a habit carried over from some other language/context where some of those characters were sensitive inside strings.
For example in above rx quoting of ':', '@' and '-' are unnecessary, and quoting '.' inside the character class is unnecessary; you'd need to make '-' the first character in the class to avoid it getting interpreted as a range though.
metson: Replying to [comment:71 rickw]:
def couchurl(candidate): return check(r'http://(localhost|(([a-zA-Z0-9:\@.-_]){1,100})(.fnal.gov|.cern.ch)):5984', candidate)
FWIW if you're using the couch instance on cmsweb you don't need the port number (everything goes over 443).
rickw: Please Review
sfoulkes: Couple issues with the latest patch:
Can you submit another patch that fixes this? I'm not going to be able to apply this as it breaks too much.
lat: Some comments:
rickw: Steve: That's weird. I tested it pretty thoroughly,
and the one running on cmssrv49:8240 works.
Could you tell me where the links think they're
pointing to? Could the problem be with
the construct href="../approve"?
Lassi:
sfoulkes: This is on the page you get after assigning a request, the link points to "approval" instead of "approve".
rickw: That "approval" bug was introduced in #825, and should have been fixed in yesterday's patch here. {{{ -
lat: Regarding relative redirects, no, HTTP redirects must use full URL. So the application needs to know where it is mounted in (public) URL space. My general recommendation is to mount back-end at the same URL as the front-end (and users) will see and use, to avoid unnecessary confusion. That means the back-end needs to be aware of the actual full URL. There is no way to avoid that if you want to make sure everything always works correctly.
rickw: Please Review
rickw: This patch fixes the issues that Steve and Lassi brought up, both here and in #825.
rickw: Please Review
rickw: Please Review
sfoulkes: (In dca953cd6f31c94abf77d15f3aabca3dbd5bd33f) Fix security case sensitivity. Fixes #497.
From: Rick Wilkinson rpw@fnal.gov Signed-off-by: Steve Foulkes sfoulkes@fnal.gov
Deploy ReqMgr application, initially just to dev instance. Ensure app developers can use it and work with it. Requires all the standard bits -- contact, accounts, ssh keys, packaging, deploy and manage scripts, monitoring, etc. Also requires update to WMCore python security module, ticket #472.