Input Validation - Githubissues

ericvaandering commented 10 years ago

Original TRAC ticket 1949 reported by giffels Implementation of input validation for DBS3 as proposed in #499

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:50 valya]:

Don't treat DAS as AI system. My opinion that underlying data-service should perform its syntax/semantics checking. For instance, user should be allowed to pass non-existent dataset path, which satisfies correct syntax, e.g. dataset=/a/b/c. If dataset=/a/b/c does not exists in DBS it is not an ERROR, since path is correct. It is the fact that dataset does not exists in DBS. And, the only way to find what exists in DBS is to query DB, rather then tweak lexicon.

I agree.

I think we agreed that usage of wild-cards should be limited to certain APIs. If dataset API allows wild-card I don't want to perform special tweaking to convert * -> ///. It is dagerous per-se. For example, one user can type * which means any dataset, while another can type *Run. In laster case it is unclear to which part of the path it should be applied. Should it be /Run// or /_/Run/_. How about EC pattern, it can be in three different places, /EC//, /_/EC/ or //_/EC (yes it does match RECO tier). So It's up to DBS to pass such wild-card to all datasets and find correct result set, rather then a job of Lexicon to decide about correctness and placement of the pattern. Yes, I do understand the optimization issue, but we said before that wild-card would be limited to certain APIs and some APIs just needs to accept it. For instance, dataset API should do that and I don't see any optimization issue here, since DB will use index on a single DB column.

Just for the discussion, I copy/pasted Simon's patch for Lexicon.py on block searching RE as an example:

r"^/(\*|[a-zA-Z0-9_\*]{1,100})/(\*|[a-zA-Z0-9_\.\-_\*]{1,100})/(\*|[A-Z]{3,10})#(\*|[a-zA-Z0-9_\.\-_\*]{1,100})$"

Dataset name and file name will be similar.

A searing on block_name=QCD* will be rejected by DBS input validation under the suggested format. Is this something we want to do in DBS?

ericvaandering commented 10 years ago

Author: valya

Just for the discussion, I copy/pasted Simon's patch for Lexicon.py on block searching RE as an example:
r"^/(\*|[a-zA-Z0-9_\*]{1,100})/(\*|[a-zA-Z0-9_\.\-_\*]{1,100})/(\*|[A-Z]{3,10})#(\*|[a-zA-Z0-9_\.\-_\*]{1,100})$"
Dataset name and file name will be similar.

A searing on block_name=QCD* will be rejected by DBS input validation under the suggested format. Is this something we want to do in DBS?

I think block_name=QCD* is not valid by definition, since there is no wild-card upfront of QCD word. The pattern in Lexicon seems to be right (to its own limitation, see below). I think to enforce such patterns you must ensure that tools which inject data into DBS and/or Phedex also use it. Otherwise you always have a possibility of mismatch between injecting and retrieval tools.

Here is obvious flaws with this pattern:

the tier part of this pattern is not applicable to dataset, since something, as simple as AAA, is allowed via this regex.
the hash part is not varied length long and has fixed, well defined, length (or at least higher lower bound, e.g. > 10, and lower upper bound, e.g. < 50). It should be defined more precisely. Right now this is allowed by the pattern: #2a, which I would consider as bad hash.
the primary dataset path allows names starting with numbers, e.g. /0123 will match, but I doubt that we have those.

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:52 valya]:

Just for the discussion, I copy/pasted Simon's patch for Lexicon.py on block searching RE as an example:
r"^/(\*|[a-zA-Z0-9_\*]{1,100})/(\*|[a-zA-Z0-9_\.\-_\*]{1,100})/(\*|[A-Z]{3,10})#(\*|[a-zA-Z0-9_\.\-_\*]{1,100})$"
Dataset name and file name will be similar.

A searing on block_name=QCD* will be rejected by DBS input validation under the suggested format. Is this something we want to do in DBS?
I think block_name=QCD* is not valid by definition, since there is no wild-card upfront of QCD word. The pattern in Lexicon seems to be right (to its own limitation, see below). I think to enforce such patterns you must ensure that tools which inject data into DBS and/or Phedex also use it. Otherwise you always have a possibility of mismatch between injecting and retrieval tools.

block_name=QCD* was a typo. I meant to say block_name=QCD. We enforce input validation for insertion to DBS using names (block, dataset and so on) defined in Lexicon.py. Not sure about phedex.

Here is obvious flaws with this pattern:

the tier part of this pattern is not applicable to dataset, since something, as simple as AAA, is allowed via this regex.

the hash part is not varied length long and has fixed, well defined, length (or at least higher lower bound, e.g. > 10, and lower upper bound, e.g. < 50). It should be defined more precisely. Right now this is allowed by the pattern: #2a, which I would consider as bad hash.

the primary dataset path allows names starting with numbers, e.g. /0123 will match, but I doubt that we have those.

OK. I will fix the pattern. DBS will require dataset and block to match the ///* pattern from now on. We all agree on this?

ericvaandering commented 10 years ago

Author: valya

OK. I will fix the pattern. DBS will require dataset and block to match the ///* pattern from now on. We all agree on this?

I'm not sure it is clear. I mean are you enforcing all three slashes to be in place, even if slash followed by wild-card. For example, are you enforcing to search for datasets with QCD name as,

QCD or /QCD// or /QCD or /_/QCD/_

If you propose to use mandatory first slash and optional second/third ones it makes sense to me, otherwise I don't understand how you're going to check where the pattern should be presented (in primary, in processed or in tier part of the dataset)? I would appreciate concrete examples of what would be allowed, e.g. use QCD as an example. In particular I want clarification how to search for dataset with QCD name in it. If I need to place several calls, just because lexicon does not allow that I will vote NO for that. I need to place 1 API call to find all datasets which should match my pattern. To be concrete here is an example of API call I would expect to find ALL datasets with QCD name in their path:

datasets(QCD)

instead of calls like that

dataset(/QCD//) dataset(/_/QCD/) dataset(//_/QCD)

ericvaandering commented 10 years ago

Author: valya Replying to [comment:54 valya]:

OK. I will fix the pattern. DBS will require dataset and block to match the ///* pattern from now on. We all agree on this?

I'm not sure it is clear. I mean are you enforcing all three slashes to be in place, even if slash followed by wild-card. For example, are you enforcing to search for datasets with QCD name as,

QCD or /QCD// or /QCD or /_/QCD/_

If you propose to use mandatory first slash and optional second/third ones it makes sense to me, otherwise I don't understand how you're going to check where the pattern should be presented (in primary, in processed or in tier part of the dataset)? I would appreciate concrete examples of what would be allowed, e.g. use QCD as an example. In particular I want clarification how to search for dataset with QCD name in it. If I need to place several calls, just because lexicon does not allow that I will vote NO for that. I need to place 1 API call to find all datasets which should match my pattern. To be concrete here is an example of API call I would expect to find ALL datasets with QCD name in their path:

datasets(QCD)

instead of calls like that

dataset(/QCD//) dataset(/_/QCD/) dataset(//_/QCD)

In other words, do not mix input validation with API functionality.

ericvaandering commented 10 years ago

Author: valya On further thought about this issue. I think you mismatch input validation of APIs with fully qualified entities and APIs which accept the pattern. You need to define how you accept the pattern and that should define how you must do your input validation against passed values. Since our dataset may be constructed with all characters/numbers you may define its pattern either starting with slash or with wild-card, followed by combination characters/numbers/(max 3)slashes and stars. But you will require no white spaces, no semicolons, etc. Doing this way you preserve correct API functionality and validate input validation against possible attacks.

If, on another hand, you preserve requirements of slashes in a path, you'll cut your own API functionality. To demonstrate this is tell me how you should find dataset for this pattern: _QCD_BlaFoo. This will lead to more then 3 combinations, e.g. datasets(/_QCD_BlaFoo//) datasets(/QCD/_BlaFoo/_) datasets(/QCD/Bla/Foo) datasets(/_QCDBla/Foo/_) datasets(/_QCDBla/_/Foo) etc.

Try to figure out possible slash placements in this pattern _a_b_c_de.....? Does input validation will enforce to do all combinatorics for ALL possible characters combinations in patterns, this seems ridiculous to me and it is not what API was designed for.

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:54 valya]:

OK. I will fix the pattern. DBS will require dataset and block to match the ///* pattern from now on. We all agree on this?

I'm not sure it is clear. I mean are you enforcing all three slashes to be in place, even if slash followed by wild-card.

If you look at the comments Simon put on this ticket in the past three days and the example search RE Simon send out in the the patch. I would say the answer was "YES". (Simon, you can correct me if I misunderstood you.)

For example, are you enforcing to search for datasets with QCD name as,

QCD or /QCD// or /QCD or /_/QCD/_

If you propose to use mandatory first slash and optional second/third ones it makes sense to me, otherwise I don't understand how you're going to check where the pattern should be presented (in primary, in processed or in tier part of the dataset)? I would appreciate concrete examples of what would be allowed, e.g. use QCD as an example. In particular I want clarification how to search for dataset with QCD name in it. If I need to place several calls, just because lexicon does not allow that I will vote NO for that. I need to place 1 API call to find all datasets which should match my pattern. To be concrete here is an example of API call I would expect to find ALL datasets with QCD name in their path:

datasets(QCD)

instead of calls like that

dataset(/QCD//) dataset(/_/QCD/) dataset(//_/QCD)

Yes. I cannot agree with you more. But we need our bosses agree on it too.

I added Dave on the cc.

ericvaandering commented 10 years ago

Author: lat There's still a part I don't follow. Under what circumstances will DAS pass the regexp to DBS? I thought DAS just gets list of all datasets, and matches the regexp against it itself, especially now that it gets "shallow" dataset info.

Remember that as Simon said, DBS is not user-facing API any more as of DBS3. Unless and except DAS sends it a wildcard for a query, I have a hard time seeing why DBS3 API would ever see wildcard in a query call. And to a very large degree I thought DAS will prefer blazingly fast calls to get more contents than it needs (but "shallow" contents), then perform wildcard matching against itself, then request more details for only selected objects (if it needs them) -- so DBS will very rarely see wildcard queries.

I agree we should avoid excessively complex wildcards, but my understanding was that DBS is not the place where they will be used, and thus there should be correspondingly less worrying and more validation for them. If you are going to match wildcards, I see little point in enforcing slash structure in there, but I would do as Valentin suggested and restrict character set (i.e., match dataset names against something like [-0-9A-Z/_*]+), maybe with a few other checks like verify there are no consecutive stars (or more generally, anything which will result in horrible rx backtracking behaviour and thus excessively slow matching; perhaps limiting number of stars to 2-3). Obviously I wouldn't allow wildcards on any call that can potentially query and return large amounts of data. For example I'd accept a wildcard on something that returns dataset names, tier names, etc., but only those names -- no wildcards on any call which actually retrieves any deeper info about those objects.

Obviously you'll also want restrictions that prevent client from retrieving for example all block names (#); you could for example require that if block name part has a wildcard, then dataset name part can't have wildcards -- but then again, I see little point in offering any wildcard query support for blocks, it's not like it makes any human sense to make wildcard queries on the block id part. Just have an API which returns shallow data on all blocks of a given dataset, I imagine that's largely sufficient for DAS (and other clients, like PhEDEx)?

Please trim quotes of previous responses. This ticket is really wieldy to read.

ericvaandering commented 10 years ago

Author: lat Shorter version of my long comment above: what is the minimal set of DBS3 API calls which require regexp match support for query arguments, and why? The "why" part should explain which API client needs to call it with wildcards, taking into account it is not user-facing API, meaning it needs to specify the tool (DAS, PhEDEx, something else) that makes the call, and why its workflow needs wildcard support in DBS API. The "minimal set" means "specify why the API must take wildcards or DM/WM system will break" -- not why it would be "nice" if the API supported wildcards.

ericvaandering commented 10 years ago

Author: yuyi Some DBS3 API calls user call use wildcard, some of them are not. For examples, listDataset when called with detail=false, one can use wildcard to list all the possible datasets one is looking for, such as dataset=/QCD, however, if it is called with detail=true, wildcard is not allowed. All these are well defined in the APIs. I believe Valentin/DAS knows exactly when wildcard can be used. This is not problem here.

The discussion is when a wildcard is used, how should we do input validation? #1: Is the input validation should be restrict to check all the three "/" be presented, something like /QCD//? Please see Simon's patch attached in these ticket for basic ideas. Or #2: Input validation should be relaxed a bit. we only check a format like /QCD or QCD? See Valentin's comment why #2 makes more sense.

BTW, The input validation I am going to add will eliminate server attack inputs in general. We should not worry about it for this discussion. Current issue is that how far we should go on input validation beyond protecting server attack?

As soon as we settle down on this issues, I can cut a release.

ericvaandering commented 10 years ago

Author: yuyi Please review.

I tagged DBS_3_0_11_b for improved input validation. The input validation is checked against REs defined in Lexicon.py(see the newest patch in ticket#2089 for all the new REs).

ericvaandering commented 10 years ago

Author: yuyi Hi, Simon, Lassi:

How is the review? Did you get time to go over the new tag yet? Thanks, yuyi

ericvaandering commented 10 years ago

Author: metson Looks a lot better, thanks. Let get some RPM's spun and push it to preprod!

ericvaandering commented 10 years ago

Author: valya Before going to preprod. I raised a question about requiring slash in datasets input field, see https://svnweb.cern.ch/trac/CMSDMWM/ticket/2280#comment:15

In my mind /* and * does not make any difference for DBS and only complicates usage of wild-card on a client side. Please comment. Since all datasets/blocks starts with slash its requirement is unnecessary.

ericvaandering commented 10 years ago

Author: giffels Replying to [comment:63 metson]:

Looks a lot better, thanks. Let get some RPM's spun and push it to preprod!

Great. RPM's are already in comp.pre.

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:64 valya]:

Before going to preprod. I raised a question about requiring slash in datasets input field, see https://svnweb.cern.ch/trac/CMSDMWM/ticket/2280#comment:15

In my mind /* and * does not make any difference for DBS and only complicates usage of wild-card on a client side. Please comment. Since all datasets/blocks starts with slash its requirement is unnecessary.

Hi, Valentin:

Today is the last working day to submit for HG1109. We cannot afford to miss it because the leading "/" in dataset/block search string. We have been going through a long way to get the point to deploy dbs3 on cmsweb. But going to preprod does not mean that we should stop discussing the issue.

All validation criteria for DBS is in Lexicon.py of WMCore. Simon has a plan to centralize all regular expression used in wmcore projects in one place so different projects can share the same code and use the same criteria. As I said in ticket#2280, the leading"/ " is not a database side searching issue. If you agree I will open a new ticket to track the discussion on how far we should go on user input validation check. Currently, the discussion is over several tickets.

Yuyi

ericvaandering commented 10 years ago

Author: yuyi Done. DBS3 is deployed on pre-prod.

dmwm / DBS

Input Validation #79