Input Validation - Githubissues

dmwm / DBS

CMS Dataset Bookkeeping Service

Apache License 2.0

7 stars 21 forks source link

Input Validation #79

Closed ericvaandering closed 10 years ago

ericvaandering commented 10 years ago

Original TRAC ticket 1949 reported by giffels Implementation of input validation for DBS3 as proposed in #499

ericvaandering commented 10 years ago

Author: giffels Currently I have implemented input validation for list/insertDataTiers and insertBlocks. Still missing are regular expressions to validate the input strings. This can be easily added using the decorator for list APIs and the validateStringInput function for insert APIs. For the future it might be reasonable to merge them into one function. I have attached a patch to this ticket. Yuyi could you please review and check that insertBlock and insertDataTier is working properly? I have currently some problems with the unittest environment for the server.

I have also attached the code of DBSValidation.py inline, since performance is really important here. I currently fear that I did not found the optimal solution for validating the json input. Currently I am using recursion with a maximal depth of 6, which avoids code duplication and results in short code (validateJSONInput) to validate the input keys. I would appreciate any comments that makes validation faster.

Concerning performance. I got a worst case scenario for insertBlocks from Yuyi. The file (/afs/cern.ch/user/y/yuyi/public/block-data-sample/block.json) is about 46M. Reading and decoding the file using cjson takes approximately 4 seconds. The current validation takes approximately 12 seconds. The block contains 1000 files with 1000 lumi sections each.

from dbs.utils.dbsExceptionHandler import dbsExceptionHandler
from dbs.utils.dbsException import dbsException,dbsExceptionCode

def SQLInputValidation(input_data):
    pass

def doSQLInputValidation(func):
    def new_func(input_data):
        SQLInputValidation(input_data)
        return func(input_data)
    return new_func

validArgs = {
    ################
    'listDataTiers':['data_tier_name']
    ################
    }

requiredInputKeys = {
    ################
    'insertDataTier':['data_tier_name'],
    ################
    'insertBlock':['file_conf_list', 'dataset_conf_list', 'block_parent_list', 'physics_group_name', 'processing_era', 'dataset', 'block', 'acquisition_era', 'primds', 'ds_parent_list', 'files', 'file_parent_list'],
    ################
    'file_conf_list':['release_version', 'pset_hash', 'lfn', 'app_name', 'output_module_label'],
    ################
    'dataset_conf_list':['release_version', 'pset_hash', 'app_name', 'output_module_label'],
    ################
    'physics_group_name':[],
    ################
    'processing_era':['processing_version'],
    ################
    'dataset':['dataset', 'is_dataset_valid', 'physics_group_name', 'processed_ds_name', 'dataset_access_type', 'data_tier_name'],
    ################
    'block': ['block_name', 'open_for_writing', 'origin_site_name'],
    ################
    'acquisition_era':['acquisition_era_name'],
    ################
    'primds':['primary_ds_type', 'primary_ds_name'],
    ################
    'files':['check_sum', 'file_lumi_list', 'event_count', 'file_type', 'logical_file_name', 'file_size'],
    ################
    'file_lumi_list':['lumi_section_num', 'run_num']
    ################
    }

@doSQLInputValidation
def validateDataTiers(input_data):
    if input_data.has_key('data_tier_name') and type(input_data['data_tier_name']) is not str:
        dbsExceptionHandler('dbsException-invalid-input', 'data_tier_name given is not valid')

    return input_data

def validateJSONInput(input_key,input_data):
    if isinstance(input_data,dict):
        return_data = {}
        for key in requiredInputKeys[input_key]:
            try:
                return_data[key] = validateJSONInput(key,input_data[key])
            except KeyError as ke:
                dbsExceptionHandler('dbsException-invalid-input', "Invalid input")

    elif isinstance(input_data,list):
        return_data = [validateJSONInput(input_key,x) for x in input_data]

    elif isinstance(input_data,str):
        return_data = validateStringInput(input_data)

    elif isinstance(input_data,int):
        return input_data

    else:
        dbsExceptionHandler('dbsException-invalid-input', "Invalid input")

    return return_data

def validateStringInput(input_key,input_data):
    #Needs some checks for strings
    return input_data

ericvaandering commented 10 years ago

Author: valya General comment about JSON. Please pay attention to RAM usage, since I discovered that serialization takes a lot of RAM. If your single JSON consumes N megabytes of RAM you need to multiply this on number of average load requests you'll expect. You may experiment with yajl (another C-based JSON parser). Its advantage wrt cjson that it can read a json stream from file-like object and it is complaint with default python JSON API. In other words if you have file-like object, e.g. source, you'll load it as following:

using python JSON: json.load(source), using cjson: data = cjson.decode(source.read()), using yajl: yajl.load(source)

So cjson needs to read data from source then use it for decoding. In my benchmarks I found that yajl is faster then cjson but margin is not that high.

Regarding the code. I'm not sure you need to return any data in validateJSONInput. It takes time and resources to build return_data. Instead you just need to throw exception for invalid key/value and do not copy any data.

ericvaandering commented 10 years ago

Author: yuyi A few comments on validateJSONInput: The "return_data" is copied and built on the "requiredInputKeys". The optional keys of files, block, dataset and so on are not in the "requiredInputKeys" list. These data will be lost when going through the validation process.

Look at the schema diagram or db will get the optional keys and a more completed picture. The file generated by Matt is to give us what the worst case may be. Each data file will changes from time to time, but everything is captured in the DB.

I will apply the patch to see if I can run it on unit test.

ericvaandering commented 10 years ago

Author: yuyi Hi, Manuel:

Did you generate your patch based on SVN HEAD? There are a lot of changes in the HEAD in the last a few days. I got a lot of error when apply your patch on my local branch that is updated with SVN HEAD. Can you check your patch? Yuyi

ericvaandering commented 10 years ago

Author: giffels Replying to [comment:4 yuyi]:

Hi, Manuel:

Did you generate your patch based on SVN HEAD? There are a lot of changes in the HEAD in the last a few days. I got a lot of error when apply your patch on my local branch that is updated with SVN HEAD. Can you check your patch? Yuyi

Hi Yuyi,

I have resolved the conflicts. Can you try again, please?

Manuel

ericvaandering commented 10 years ago

Author: lat One comment on your incoming json, it's repetitive to an extreme degree. You should probably reduce the verbosity of the incoming data; it will help validation performance / complexity too.

You might also want to try using map() and/or filter() with validation predicates instead of manual object construction -- possibly not even constructing anything at all, just verifying it, provided your verification is thorough enough. As the validator doesn't currently check any e.g. string contents, it's hard to say much about performance.

Finally, before concluding anything about your numbers, we also need those profiles, not just the total execution time. In the past I've seen python just go spend 95% time in garbage collector when processing inputs like these -- the default gc settings aren't very friendly for mass object creation, with no deletions in between.

ericvaandering commented 10 years ago

Author: giffels Replying to [comment:2 valya]:

General comment about JSON. Please pay attention to RAM usage, since I discovered that serialization takes a lot of RAM. If your single JSON consumes N megabytes of RAM you need to multiply this on number of average load requests you'll expect. You may experiment with yajl (another C-based JSON parser). Its advantage wrt cjson that it can read a json stream from file-like object and it is complaint with default python JSON API. In other words if you have file-like object, e.g. source, you'll load it as following:

This is a worst case scenarion json. I guess we will not have many of requests of that type in parallel. However, we need to pay attention to RAM usage. Thanks.

Regarding the code. I'm not sure you need to return any data in validateJSONInput. It takes time and resources to build return_data. Instead you just need to throw exception for invalid key/value and do not copy any data.

I was suggested to copy the valid parts of the input json. Therefore, I was a little bit routine-blinded. Of course copying takes a lot of time. I have evaluated the same code without copying and it takes just 6 seconds to validate the json. Thanks for your suggestions.

ericvaandering commented 10 years ago

Author: giffels Replying to [comment:3 yuyi]:

A few comments on validateJSONInput: The "return_data" is copied and built on the "requiredInputKeys". The optional keys of files, block, dataset and so on are not in the "requiredInputKeys" list. These data will be lost when going through the validation process.

Look at the schema diagram or db will get the optional keys and a more completed picture. The file generated by Matt is to give us what the worst case may be. Each data file will changes from time to time, but everything is captured in the DB.

Sorry, I just took into account all keys that were in the json file. I will extend the list.

Thanks

Manuel

ericvaandering commented 10 years ago

Author: lat Replying to [comment:7 giffels]:

I was suggested to copy the valid parts of the input json.

Yeah, that's my strong preference, but I won't make a rule out of it :-) If there are clear reasons not to do that, they are fine. But it does require rock solid validation in that case, e.g. making sure there are no unsolicited contents - which you don't have. So no carte blanche from me right now. And justify it's needed to have special casing, we need to see performance profiles. You don't want to be embarrassed by claiming validation is the culprit, when the real problem is elsewhere :-)

If you spend half your time due to allocation - profiles needed to confirm that - then there are other solution which one may try, e.g. something like import gc; gc.set_threshold(256*1024) might fix things up. It may also may things much worse, you need to experiment with a real server.

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:5 giffels]:

I have resolved the conflicts. Can you try again, please?

I had no problem to apply the patch. There were a handful bugs in the patch. The insert/list data tier API past tests after the bugs fixed. The insert block API could not be tested due there are missing global tag key in the data. I will ask them to regenerate the data file on Monday.

ericvaandering commented 10 years ago

Author: giffels Replying to [comment:9 lat]:

Replying to [comment:7 giffels]:

I was suggested to copy the valid parts of the input json.

Yeah, that's my strong preference, but I won't make a rule out of it :-) If there are clear reasons not to do that, they are fine. But it does require rock solid validation in that case, e.g. making sure there are no unsolicited contents - which you don't have. So no carte blanche from me right now. And justify it's needed to have special casing, we need to see performance profiles. You don't want to be embarrassed by claiming validation is the culprit, when the real problem is elsewhere :-)

If you spend half your time due to allocation - profiles needed to confirm that - then there are other solution which one may try, e.g. something like import gc; gc.set_threshold(256*1024) might fix things up. It may also may things much worse, you need to experiment with a real server.

I have now a real server running with cProfile. I am writing the output to a file. Additionally, I have installed the external+igprof+5.9.2-comp package. I cannot find igpython in that package and using igprof-navigator on the output of cProfile is not working, since the output format seems to be not supported. Can you give me a hint how to use igprof/igprof-navigator to profile our server, please?

Thanks

Manuel

P.S.: Once the validation patch is entirely reviewed by Yuyi, we will provide the required profiling.

ericvaandering commented 10 years ago

Author: lat Replying to [comment:11 giffels]:

I have now a real server running with cProfile.

Great! I look forward to the results.

I am writing the output to a file. Additionally, I have installed the external+igprof+5.9.2-comp package. I cannot find igpython in that package and using igprof-navigator on the output of cProfile is not working, since the output format seems to be not supported. Can you give me a hint how to use igprof/igprof-navigator to profile our server, please?

Ah, my bad, it looks like we didn't yet make a release with the various new tools in it. You can find igpython-analyse in the [http://igprof.git.sourceforge.net/git/gitweb.cgi?p=igprof/igprof;a=blob;f=src/igpython-analyse;h=f7dd7ddce23a94dcde7838e643cf48c5d971268f;hb=HEAD git repo]; click on 'raw' to download the script.

I assume you got igprof-navigator already working, but in case not, just [http://igprof.sourceforge.net/analysis.html follow the instructions]. With cProfile it works the same, except igpython-analyse replaces igprof-analyse, and cProfile replaces igprof.

ericvaandering commented 10 years ago

Author: lat I promised example code; it's still completely untested and much less complete than I would have liked, but you can find what I have in /afs/cern.ch/user/l/lat/scratch/dmwm/sitedb. It's starting to be close to what I intend to make work, though.

ericvaandering commented 10 years ago

Author: giffels Replying to [comment:12 lat]:

Replying to [comment:11 giffels]:

I have now a real server running with cProfile.

Great! I look forward to the results.

I am writing the output to a file. Additionally, I have installed the external+igprof+5.9.2-comp package. I cannot find igpython in that package and using igprof-navigator on the output of cProfile is not working, since the output format seems to be not supported. Can you give me a hint how to use igprof/igprof-navigator to profile our server, please?

Ah, my bad, it looks like we didn't yet make a release with the various new tools in it. You can find igpython-analyse in the [http://igprof.git.sourceforge.net/git/gitweb.cgi?p=igprof/igprof;a=blob;f=src/igpython-analyse;h=f7dd7ddce23a94dcde7838e643cf48c5d971268f;hb=HEAD git repo]; click on 'raw' to download the script.

I assume you got igprof-navigator already working, but in case not, just [http://igprof.sourceforge.net/analysis.html follow the instructions]. With cProfile it works the same, except igpython-analyse replaces igprof-analyse, and cProfile replaces igprof.

Thanks, works fine.

Replying to [comment:13 lat]: Thank you for the example code. I will have a look.

ericvaandering commented 10 years ago

Author: lat Replying to [comment:14 giffels]:

Thanks, works fine.

Great. Let us know your igprof-navigator site when you are far enough to have reasonable profiles to share.

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:10 yuyi]:

I had no problem to apply the patch. There were a handful bugs in the patch. The insert/list data tier API past tests after the bugs fixed. The insert block API could not be tested due there are missing global tag key in the data. I will ask them to regenerate the data file on Monday.

I attached the patch that I used for testing. It worked for inserting fileXlumi as 100x100, 500X500 and 1000X1000 input data. New data files in /afs/cern.ch/user/y/yuyi/public/block-data-sample are tested to make sure they are in correct format.

We are continuing on profiling the server in two ways of validation.

ericvaandering commented 10 years ago

Author: giffels I have run a profiling test with cProfile. After converting the output to a igprof-navigator readable db, I do not see the validation function calls on the igprof-navigator webservice. I have run

python -u -m cProfile -o ./profile.out /home/cmsdbs/manuel/PatchTest/current/apps/dbs/lib/python*/site-packages/WMCore/WebTools/Root.py --ini=/home/cmsdbs/manuel/PatchTest/current/config/dbs/DBS.py python igpython-analyse.py profile.out | sqlite3 profile.sql3

Event if I look at the text output of python -u -m cProfile /home/cmsdbs/manuel/PatchTest/current/apps/dbs/lib/python*/site-packages/WMCore/WebTools/Root.py --ini=/home/cmsdbs/manuel/PatchTest/current/config/dbs/DBS.py

I do not see all calls of the validation function, that means I see only the once called directly by wmcore using addMethod_. I have checked that the validation function is called using the pdb. Is there some option the increase the depth of printed function calls in cProfile. Do you have any hint?

ericvaandering commented 10 years ago

Author: lat Umm, I don't really know. I've been using cherrypy.lib.profiler, both as stand-alone (it includes miniserver you can use to browse your profile results) and more recently with igprof-navigator.

The former is pretty trivial, just use 'python -m cherrypy.lib.profiler PATH-OF-YOUR-PROFILE-DATA', and browse at localhost:8080.

If you want to do the latter, change 'import profile' to 'import cProfile as profile' in your local install of cherrypy, and use ProfileAggregator to wrap your app registrations (or Profiler to get profiles for individual HTTP requests). The latter has existed in DQM GUI / Overview code base since forever, and I now have it implemented in the REST server code I am working on, but it doesn't exist in the WMCore as it is now. Anyway, with that plus small change to igpython-analyse, I get fairly nice profiles with igprof-navigator.

Hard to say why you don't see your validation in the data. Profiling the app I develop, as far as I can tell I can see everything.

ericvaandering commented 10 years ago

Author: lat For an example profile I am working on, see http://cern.ch/lat/cmssw/igprof-navigator/sitedb-prof, for example http://cern.ch/lat/cmssw/igprof-navigator/sitedb-prof/264 (not guaranteed to stay there permanently).

ericvaandering commented 10 years ago

Author: yuyi I ran profile on DBS bulk insert. Attached the profile cp_F10_L10.prof here, This bulk has 10 file and each file has 10 lumis. The cumulative time 0.140 CPU seconds. The validation part is one of the big players. Please take a look at the profile.

Will run the big files tomorrow.

ericvaandering commented 10 years ago

Author: giffels Replying to [comment:18 lat]:

Umm, I don't really know. I've been using cherrypy.lib.profiler, both as stand-alone (it includes miniserver you can use to browse your profile results) and more recently with igprof-navigator.

Might be an problem with threads. I read somewhere that cProfile works per default only on the main thread. Just to understand you correctly. You are using cProfile to obtain the profile data or do you use cherrypy.lib.profiler for that?

The former is pretty trivial, just use 'python -m cherrypy.lib.profiler PATH-OF-YOUR-PROFILE-DATA', and browse at localhost:8080.

This does not work with the profile data I obtained with cProfile, I get an internal server error, when browsing to localhost:8080.

If you want to do the latter, change 'import profile' to 'import cProfile as profile' in your local install of cherrypy, and use ProfileAggregator to wrap your app registrations (or Profiler to get profiles for individual HTTP requests). The latter has existed in DQM GUI / Overview code base since forever, and I now have it implemented in the REST server code I am working on, but it doesn't exist in the WMCore as it is now. Anyway, with that plus small change to igpython-analyse, I get fairly nice profiles with igprof-navigator.

Could you point me to some code examples for that? In particular for wraping the app registrations by ProfileAggregator and for patching igpython-analyse.

Thanks

ericvaandering commented 10 years ago

Author: lat Replying to [comment:21 giffels]:

Might be an problem with threads. I read somewhere that cProfile works per default only on the main thread. Just to understand you correctly. You are using cProfile to obtain the profile data or do you use cherrypy.lib.profiler for that?

I use cherrypy.lib.profiler, as-is and also modified to use cProfile. It only ever profiles one thread at a time - the profiling is per request - and I've not ran into any major problems that I could see.

This does not work with the profile data I obtained with cProfile, I get an internal server error, when browsing to localhost:8080.

Umm, works fine for me, for data generated with cherrypy.lib.profile. Can you reproduce instructions on how to make it fail?

Could you point me to some code examples for that? In particular for wraping the app registrations by ProfileAggregator and for patching igpython-analyse.

I committed the igpython-analyse changes to [http://igprof.git.sourceforge.net/git/gitweb.cgi?p=igprof/igprof;a=commitdiff;h=9c402816a95f807c63b52d1490434daafdfb5393 git].

You can find examples for the server wrapper in [http://cern.ch/lat/temp/sitedb2/bin/wmc-httpd.xhtml wmc-httpd] which I am working on. See lines 429-447 for profiler instrumentation. That code uses cherrypy.lib.profile, which I modified to use cProfile (by changing "import profile" to "import cProfile as profile").

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:21 giffels]:

Could you point me to some code examples for that? In particular for wraping the app registrations by ProfileAggregator and for patching igpython-analyse.

I got profile working already. There are a lot of ways to do profiling. I will commit what I did into the validation branch in DBS SVN. I attached the report in this ticket yesterday. Please check it out and to see what we can do to improve the validation.

ericvaandering commented 10 years ago

Author: lat Replying to [comment:20 yuyi]:

I ran profile on DBS bulk insert. Attached the profile cp_F10_L10.prof here, This bulk has 10 file and each file has 10 lumis. The cumulative time 0.140 CPU seconds. The validation part is one of the big players. Please take a look at the profile.

Thanks. However... The profile you attached clearly shows you are not using standard software installation from RPMs. For example there are paths such as /uscms/home/yuyi/dbs3-test/External/python/2.6.4-cmp13/lib/python2.6/sre_parse.py and /uscms/home/yuyi/dbs3-test/External/SQLAlchemy-0.5.8/build/lib/sqlalchemy/queue.py and /uscms/home/yuyi/dbs3-test/External/WMCORE/src/python/WMCore/Database/DBCore.py in it, which cannot possibly have originated from CMS APT/RPM installation, much less from DMWM Deploy based installation of DBS3.

You urgently need to shift to working in standard DMWM deployment scheme. I'll look at this profile, but I won't look at any other profiles or other data, nor will otherwise help DBS project until you've confirmed you are working 100% in standard environment.

Looking at your profile inside cherrpy.lib.profile server, it says cumtime 0.140, of which only 0.032 was spent in validateJSONInput, and 0.024 in validateJSONInputNoCopy (why two? is one inside the other?).

However for example 0.036 was spent in SequenceManager.py(increment), and a fair amount of time elsewhere in !SQLAlchemy.

I'd have to see igprof-navigator version to see call tree information, but superficially the above seems pretty reasonable to me. It does not seem to agree with your conclusion however.

ericvaandering commented 10 years ago

Author: giffels Replying to [comment:24 lat]:

Looking at your profile inside cherrpy.lib.profile server, it says cumtime 0.140, of which only 0.032 was spent in validateJSONInput, and 0.024 in validateJSONInputNoCopy (why two? is one inside the other?).

No, we just want to check both methods to compare the performance. Later, only one is used.

I'd have to see igprof-navigator version to see call tree information, but superficially the above seems pretty reasonable to me. It does not seem to agree with your conclusion however.

We are currently working on that. However, I cannot run igpython-analyse on that file. The problem could be, that the cherrypy was not yet patched to use cProfile?

ericvaandering commented 10 years ago

Author: lat Replying to [comment:25 giffels]:

No, we just want to check both methods to compare the performance. Later, only one is used.

OK, but then obviously you need to count the penalty for just one of them ;-)

I'd have to see igprof-navigator version to see call tree information, but superficially the above seems pretty reasonable to me. It does not seem to agree with your conclusion however.

We are currently working on that. However, I cannot run igpython-analyse on that file. The problem could be, that the cherrypy was not yet patched to use cProfile?

Yes, igpython-analyse requires profiles created with cProfile. You can use cherrpy.lib.profile server with 'profile' results, but it's much less readable than igprof-navigator one. I'll try to get in the patch to use cProfile for our standard cherrypy installation ASAP.

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:24 lat]:

You urgently need to shift to working in standard DMWM deployment scheme. I'll look at this profile, but I won't look at any other profiles or other data, nor will otherwise help DBS project until you've confirmed you are working 100% in standard environment.

This is our first try to get the profiling work. We are working off a SVN branch instead of the trunk. We need to patch the rpm installation in order to use the branch. We are doing it this morning. So next profile will from there.

However for example 0.036 was spent in SequenceManager.py(increment), and a fair amount of time elsewhere in !SQLAlchemy.

The sequence is very simple select statement and I don't know how much we can improve it. But I will check if we can pre-fetch on all the sequences. I know some of them we already did. The SQLAlchemy are outside library we use. Any suggestion on how to improve it will be helpful.

ericvaandering commented 10 years ago

Author: lat Replying to [comment:27 yuyi]:

This is our first try to get the profiling work. We are working off a SVN branch instead of the trunk. We need to patch the rpm installation in order to use the branch. We are doing it this morning. So next profile will from there.

Your profile appears to indicate that literally everything, and including specifically at least python itself, sqlalchemy, cherrpypy and wmcore, in your area originates from some kind of manual build/install. The directory structure present in your profile is highly unlikely to have have originated from CMS RPM installation - not just for DBS but for all the externals as well.

Specifically the profile contains paths such as:

/uscms/home/yuyi/dbs3-test/External/CherryPy-3.1.2/cherrypy/__init__.py
/uscms/home/yuyi/dbs3-test/External/SQLAlchemy-0.5.8/build/lib/sqlalchemy/databases/oracle.py
/uscms/home/yuyi/dbs3-test/External/SQLAlchemy-0.5.8/build/lib/sqlalchemy/engine/base.py
/uscms/home/yuyi/dbs3-test/External/SQLAlchemy-0.5.8/build/lib/sqlalchemy/engine/default.py
/uscms/home/yuyi/dbs3-test/External/SQLAlchemy-0.5.8/build/lib/sqlalchemy/engine/strategies.py
/uscms/home/yuyi/dbs3-test/External/SQLAlchemy-0.5.8/build/lib/sqlalchemy/exc.py
/uscms/home/yuyi/dbs3-test/External/SQLAlchemy-0.5.8/build/lib/sqlalchemy/pool.py
/uscms/home/yuyi/dbs3-test/External/SQLAlchemy-0.5.8/build/lib/sqlalchemy/queue.py
/uscms/home/yuyi/dbs3-test/External/SQLAlchemy-0.5.8/build/lib/sqlalchemy/types.py
/uscms/home/yuyi/dbs3-test/External/SQLAlchemy-0.5.8/build/lib/sqlalchemy/util.py
/uscms/home/yuyi/dbs3-test/External/WMCORE/src/python/WMCore/DataStructs/WMObject.py
/uscms/home/yuyi/dbs3-test/External/WMCORE/src/python/WMCore/Database/DBCore.py
/uscms/home/yuyi/dbs3-test/External/WMCORE/src/python/WMCore/Database/DBFormatter.py
/uscms/home/yuyi/dbs3-test/External/WMCORE/src/python/WMCore/Database/ResultSet.py
/uscms/home/yuyi/dbs3-test/External/WMCORE/src/python/WMCore/Lexicon.py
/uscms/home/yuyi/dbs3-test/External/python/2.6.4-cmp13/lib/python2.6/encodings/utf_8.py
/uscms/home/yuyi/dbs3-test/External/python/2.6.4-cmp13/lib/python2.6/re.py
/uscms/home/yuyi/dbs3-test/External/python/2.6.4-cmp13/lib/python2.6/sre_compile.py
/uscms/home/yuyi/dbs3-test/External/python/2.6.4-cmp13/lib/python2.6/sre_parse.py
/uscms/home/yuyi/dbs3-test/External/python/2.6.4-cmp13/lib/python2.6/threading.py
/uscms/home/yuyi/dbs3-test/External/python/2.6.4-cmp13/lib/python2.6/weakref.py

An installation originating from CMS RPMs would have paths such as:

<root>/slc5_amd64_gcc434/external/python/2.6.4-comp3/lib/python2.6/sre_parse.py
<root>/slc5_amd64_gcc434/external/py2-sqlalchemy/0.5.2-comp3/lib/python2.6/site-packages/sqlalchemy/queue.py
<root>/slc5_amd64_gcc434/external/cherrypy/3.1.2-comp3/lib/python2.6/site-packages/cherrypy/__init__.py

Note for example the following differences from the directory structure in your profile:

CMS ones always have an architecture/category/package/version directory structure
The above name components are usually lowercase (except PhEDEx, T0*, which are not present in your profile)
Many python packages are called py2-something, for example py2-sqlalchemy
There are supposed to be lib/python2.6/site-packages subdirectories
There are no 'build' subdirectories (except in YUI, which is not present in your profile)
WMCore is installed in your package, not as a separate top level product
There are differences in version numbers in use, e.g. for sqlalchemy
The version suffix is 'comp' not 'cmp' in any recent installation

In light of this, I am skeptical of your explanation above. You are welcome to attach new profiles for this ticket, but I will look at them if and only if you explicitly state in the accompanying comment that you generated the profile(s) in a standard installation scheme: DMWM deploy-based, as installed for example by [https://cern.ch/cms-http-group/dev-vm.html dev-vm instructions].

However for example 0.036 was spent in SequenceManager.py(increment), and a fair amount of time elsewhere in !SQLAlchemy.

The sequence is very simple select statement and I don't know how much we can improve it. But I will check if we can pre-fetch on all the sequences. I know some of them we already did. The SQLAlchemy are outside library we use. Any suggestion on how to improve it will be helpful.

I only meant the time spent in validation is reasonable compared to for example the time spent in sqlalchemy. If the profile reports CPU time, keep in mind it doesn't measure time spent in waiting for oracle, but only the time spent in computation in the database layer, on client side.

If you think sqlalchemy gets on your way, then get rid of it. I've never used it so can't comment on the value it offers or any overhead costs. I can tell from looking at profiles how much it adds to your costs; I can't easily judge benefits.

Independent of this, as I've mentioned before, I believe you can completely eliminate all the sequence juggling business from your python code, hence shaving whatever time is needed by that particular function.

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:28 lat]:> -

In light of this, I am skeptical of your explanation above. You are welcome to attach new profiles for this ticket, but I will look at them if and only if you explicitly state in the accompanying comment that you generated the profile(s) in a standard installation scheme: DMWM deploy-based, as installed for example by [https://cern.ch/cms-http-group/dev-vm.html dev-vm instructions].

I think you misunderstood what I said. I told you that I did not use our rpm deployment area to generate the profile because we need to patch it with the SVN branch. I never said that I was using a rpm deployment area to generate the profile. Again, I will generate and test from our rpm deployment area when I next time post it. We were creating the patch yesterday morning.

Independent of this, as I've mentioned before, I believe you can completely eliminate all the sequence juggling business from your python code, hence shaving whatever time is needed by that particular function.

There is no "sequence juggling business" in my code. They are very simple code to support a complete sequence function.

ericvaandering commented 10 years ago

Author: lat Replying to [comment:29 yuyi]:

I think you misunderstood what I said.

I am not sure I did. It's very difficult for me to piece together why you would build your own version of python, sqlalchemy etc., in order to patch your own version of DBS from SVN as you mentioned above. Those things are not in SVN. The information you've provided so far does not really add up. It doesn't really help this is not the first time I've asked to move to using RPMs; there is some resemblance here to previous situations.

Anyway, I am glad to hear your next profile will be based on RPM install area.

There is no "sequence juggling business" in my code. They are very simple code to support a complete sequence function.

What I've mentioned several time, starting at September review, is that you don't need to retrieve sequence values into python. You can just use them directly in SQL, without ever retrieving the values to the server application. The extra roundtrip to the database, to fetch the sequence value then use that value in insert statement, is what I called "juggling business". Normally you'd have SQL looking like this:

   insert into some_table (id, col1, col2)
   values (some_sequence.nextval, :col1, :col2)

Instead of what you do now:

  select some_sequence.nextval from dual

followed by:

   insert into some_table (id, col1, col2)
   values (:id_you_fetched_above, :col1, :col2)

Note that you can use "insert ... returning ..." statement to retrieve the inserted id if you need it back in the application. You still avoid the extra roundtrip to the database. If you are inserting lots of things -- like the insert block with many files etc. -- you can save potentially thousands of individual round-trips to the database to retrieve sequence values. The single insert with direct sequence use also runs more efficiently for variety of oracle technical reasons.

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:30 lat]:

There is no "sequence juggling business" in my code. They are very simple code to support a complete sequence function.

What I've mentioned several time, starting at September review, is that you don't need to retrieve sequence values into python. You can just use them directly in SQL, without ever retrieving the values to the server application. The extra roundtrip to the database, to fetch the sequence value then use that value in insert statement, is what I called "juggling business". Normally you'd have SQL looking like this:
   insert into some_table (id, col1, col2)
   values (some_sequence.nextval, :col1, :col2)
Instead of what you do now:
  select some_sequence.nextval from dual
followed by:
   insert into some_table (id, col1, col2)
   values (:id_you_fetched_above, :col1, :col2)
Note that you can use "insert ... returning ..." statement to retrieve the inserted id if you need it back in the application. You still avoid the extra roundtrip to the database. If you are inserting lots of things -- like the insert block with many files etc. -- you can save potentially thousands of individual round-trips to the database to retrieve sequence values. The single insert with direct sequence use also runs more efficiently for variety of oracle technical reasons.

I understand your point. But the DBS case is a bit different. In most case, we need to hold a sequence number for using it many times after it is generated initially. For example, a dataset ID (a sequence number) will be used to insert into block table, file table, configuration table and the list goes on. Using "insert.. return..." may help a little bit, but it would not be much since we need the ID return to app anyway.

We use block insert for file and a list of other tables so that we can insert a lot of data at one time. With block insert, one has to build the entire data structure before the insertion. So it requires to get the sequence into app. In order to reduce the database access, we build our sequence increment at a large number so one select sequence call can generate a bunch of IDs for a table.

So we pay attention to managing our sequences from both app and database side.

ericvaandering commented 10 years ago

Author: lat Replying to [comment:31 yuyi]:

I understand your point. But the DBS case is a bit different. [...]

Since DBS performance is a concern, it's only natural we'll be guiding the design based on solid data from rigorous measurements. Once we introduce reliable profiling as a part of development workflow, we'll be able to determine pros and cons of alternate implementations more robustly. We'll investigate alternatives especially where measurements indicate much time is being spent. This is to say, I hear what you say, but please keep an open mind as we'll explore the different alternatives in due course. Based on my experience on high performance server and database code, I'd say it's quite likely we'll want to change the database interaction patterns.

ericvaandering commented 10 years ago

Author: hufnagel

We use block insert for file and a list of other tables so that we can insert a lot of data at one time. With block insert, one has to build the entire data structure before the insertion. So it requires to get the sequence into app. In order to reduce the database access, we build our sequence increment at a large number so one select sequence call can generate a bunch of IDs for a table.

You can use the technique Lassi described with block inserts, that's not a problem. If you need to use the id in another call, you could be able to return the ids from the insert call with RETURNING (and maybe enough information to associate it to the row). Or the other inserts then could retrieve the id again inside the call (another select look-up), but that could well be faster anyways. There are many ways to achieve the same result. In general, based on the optimization experience with the Tier0, anytime you return information from Oracle just to use it again in the next call, it's a warning sign. This will usually be slower (possibly much slower, up to orders of magnitude) than finding a way to let Oracle handle things internally.

Recent example from the Tier0, I had some code that retrieved file ids and then updated some properties associated with the files. I did some scale testing which exercises this routine much more than usual, causing it to slow down the test. By moving the file id determination into the update call itself, the performance improved from 24 hours to 10 minutes (both data retrieval and update were already bulk calls before btw). Unnecessary round-trips and unnecessary data passing between database and client application kills performance.

For your use case here, what you might want to look into is building all the data structures for all tables you need to insert into upfront, leaving out all Oracle assigned ids. Then insert everything at once with a single insert, using the Oracle 'INSERT ALL' feature. Basic example:

         INSERT ALL
           INTO table1
             (ID, PARAM1, PARAM2)
             VALUES (id, :param1, :param2)
           INTO table2
             (ID, PARAM3)
             VALUES (id, :param3)
         SELECT some_sequence.nextval AS id
         FROM some_sequence

ericvaandering commented 10 years ago

Author: lat Thanks Dirk, and of course I agree entirely. There's indeed a number of different constructs available for optimisation.

ericvaandering commented 10 years ago

Author: giffels I have compiled a list of results from our profiling. I have used my rpm based installation of DBS 3.0.11 to obtain the results, but without using a Apache frontend. Hope, that is okay? I have patched the installation to be able to do a profiling using the patch option of our setup.py script, which I have inherited from Lassi's setup scripts. In addition, cherrypy was patched to use cProfile instead of Profile.

Insertion of a block containing 10 files each containing 10 lumisections: https://test-dbs3-profiling.web.cern.ch/test-dbs3-profiling/cgi-bin/igprof-navigator/insertBulkBlock-f10-L10

Insertion of a block containing 100 files each containing 100 lumisections: https://test-dbs3-profiling.web.cern.ch/test-dbs3-profiling/cgi-bin/igprof-navigator/insertBulkBlock-f100-L100

Insertion of a block containing 500 files each containing 500 lumisections: https://test-dbs3-profiling.web.cern.ch/test-dbs3-profiling/cgi-bin/igprof-navigator/insertBulkBlock-f500-L500

Insertion of a block containing 1000 files each containing 1000 lumisections: https://test-dbs3-profiling.web.cern.ch/test-dbs3-profiling/cgi-bin/igprof-navigator/insertBulkBlock-f1000-L1000

In general input validation takes about 5-10% of the time insertBulkBlock takes.

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:33 hufnagel]:

For your use case here, what you might want to look into is building all the data structures for all tables you need to insert into upfront, leaving out all Oracle assigned ids. Then insert everything at once with a single insert, using the Oracle 'INSERT ALL' feature. Basic example:
         INSERT ALL
           INTO table1
             (ID, PARAM1, PARAM2)
             VALUES (id, :param1, :param2)
           INTO table2
             (ID, PARAM3)
             VALUES (id, :param3)
         SELECT some_sequence.nextval AS id
         FROM some_sequence
Thanks for the suggestion. We will look into this. Maybe prototype the insert file case to see if we can speed up the insertion by using the way you suggested.

ericvaandering commented 10 years ago

Author: yuyi Almost Done. Need to pass all the unit tests. Set As August goal.

ericvaandering commented 10 years ago

Author: yuyi I just committed a full version of input validation for web security. The svn tag is DBS_3_0_11_a. DAO object, Business object and REST Model refactor will coming soon. Please review the DBS_3_0_11_a and let me know if DBS has fulfilled the required input validation for deploying on cmsweb.

ericvaandering commented 10 years ago

Author: yuyi Simon, Lassi:

Please review DBS_3_0_11_a for input validation.

Thanks, yuyi

ericvaandering commented 10 years ago

Author: metson I'll take a look at this tomorrow. Thanks for making the changes.

ericvaandering commented 10 years ago

Author: metson Sorry, I got held up in a vendor sales pitch and didn't get to this. Will go through the code tomorrow (before I leave on vacation).

ericvaandering commented 10 years ago

Author: metson Sorry if this is a bit short - trying to tie up loose ends before leaving on vacation.

dbsExceptionHandler is unnecessary and makes the code hard to follow - just raise the HTTP error in the model code. I guess that might be part of the refactor you mention above...
You use two different "paradigms" for validation, decorator and inlined code (in insertFile for instance), which makes the code hard to follow - pick one ;)
- What's the reason to not use the "standard" RESTModel input validation? I forget if there was a reason or not...
the validation you have is a good first step (e.g. checking type, checking strings for some potentially dangerous characters, not reflecting input back to the client) but it is insufficient. Take, for example, listFileSummaries. The function takes block_name, dataset and run_num. These are validated as string, string, int (which is good!) but that's it. I would expect the strings to then be validated as correct block/dataset names (using block and dataset from Lexicon.py). This is done for POST/PUT but seems not to be for GET?
In fact the validation is dangerous - it gives a false sense of security. Using the searchingstr from #2089 I can do the following:

>>> from WMCore.Lexicon import searchingstr
>>> candidate = "drop table files;"
>>> searchingstr(candidate)
True

e.g. querying the server with ?dataset=drop table files will pass drop table files to the DAO layer (which assumes that its input is safe and does no validation). While I think that wouldn't actually execute (the DAO does use bind variables, though because of how MySQL deals with binds I think it would execute if MySQL was used...) that string shouldn't get anywhere near the database layer and the layers of code (REST, business, dao) makes it hard for me to be totally confident that it wouldn't execute.

I ''think'' this wouldn't happen for POST type queries, since they would use validateJSONInputNoCopy, validateStringInput and the regexps from Lexicon

Somewhat unrelated to validation:

Instead of @tools.secmodv2(authzfunc=authInsert) you could set secured and security_params in the _addMethod call. Alternatively, if all the functions need the same security level, I think you could use the @tools.secmodv2 on the model's handler method - e.g. do it once for all methods.

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:42 metson]:

Sorry if this is a bit short - trying to tie up loose ends before leaving on vacation.

Sorry Simon that I disagree with you for several of your comments. I hope we can discusses these before you leave for vacation.

dbsExceptionHandler is unnecessary and makes the code hard to follow - just raise the HTTP error in the model code. I guess that might be part of the refactor you mention above...

dbsExceptionHandler is used because it helps logging. We don't have to repeat the logger(eCode + ": " + serverError) code everywhere we want to log the error. The dbsExceptionHandler provide the same interface for all the exceptions. In addition, HTTP error cannot cover all the errors we have. I want to be more explicitly for some error so it helps to pinpoint a problem.

You use two different "paradigms" for validation, decorator and inlined code (in insertFile for instance), which makes the code hard to follow - pick one ;)

I chose two different validation method for POST/PUT call and GET call because usually POST/PUT call have much more input data to validate so use the inline call to loop over all the data. On the other hand, Most GET calls have a few input as they just are the search criteria. Using the decorator to list all the possible values and types as input, it makes the validation very simple and shows exact what need to be validated at the top. One generic decorator can cover all the GET call validation. So less code.

What's the reason to not use the "standard" RESTModel input validation? I forget if there was a reason or not...

I wrote less code if I use a decorator.

the validation you have is a good first step (e.g. checking type, checking strings for some potentially dangerous characters, not reflecting input back to the client) but it is insufficient. Take, for example, listFileSummaries. The function takes block_name, dataset and run_num. These are validated as string, string, int (which is good!) but that's it. I would expect the strings to then be validated as correct block/dataset names (using block and dataset from Lexicon.py). This is done for POST/PUT but seems not to be for GET?

We got understand that the POST/PUT call is that a client want to insert data into DBS or wants to do a searching with a list input that is not allowed to use any wildcards. With a GET call, in general a client is searching DBS with minimal information, they may use wildcards in their input. So I cannot restrict the names following format defined in Lexicon.py. Although, some database heavy GET call APIs require fully qualified input too, but this is checked by the APIs themselves. Here we are doing input validation and I understand that it focuses on if the input can caused any harm/attack to the server.

In fact the validation is dangerous - it gives a false sense of security. Using the searchingstr from #2089 I can do the following:
>>> from WMCore.Lexicon import searchingstr
>>> candidate = "drop table files;"
>>> searchingstr(candidate)
True
e.g. querying the server with ?dataset=drop table files will pass drop table files to the DAO layer (which assumes that its input is safe and does no validation). While I think that wouldn't actually execute (the DAO does use bind variables, though because of how MySQL deals with binds I think it would execute if MySQL was used...) that string shouldn't get anywhere near the database layer and the layers of code (REST, business, dao) makes it hard for me to be totally confident that it wouldn't execute.

Why we worry about mysql? We said that DBS3 will not support mysql last September, right? As you pointed out the bind variables are used in dao. In fact, they are the fundamental protection for us to avoid a lot of sql attacks. There are more than "drop table" DDL are evil, should we check each of them in the input?

I ''think'' this wouldn't happen for POST type queries, since they would use validateJSONInputNoCopy, validateStringInput and the regexps from Lexicon

As I said earlier, for searching the DB, we cannot enforce LFN, dataset, block and so on to match Lexicon.

Somewhat unrelated to validation:

Instead of @tools.secmodv2(authzfunc=authInsert) you could set secured and security_params in the _addMethod call. Alternatively, if all the functions need the same security level, I think you could use the @tools.secmodv2 on the model's handler method - e.g. do it once for all methods.

I will look into this.

Thanks, Yuyi

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:43 yuyi]:

Why we worry about mysql? We said that DBS3 will not support mysql last September, right? As you pointed out the bind variables are used in dao. In fact, they are the fundamental protection for us to avoid a lot of sql attacks. There are more than "drop table" DDL are evil, should we check each of them in the input?

Talked with Dave on this. Here is what we can do to eliminate the "evil DDLs": 1) remove white space in searchingstr and namestr function of Lexicon.py or 2) check for db key words: drop, alter, delete, insert, create or 3) check for DBS3 table names

I checked the names (such as LFN , block_name, dataset_name, tiers and so on) in current production DB. I found none of the names in the DB has a white space there. In addition, check with Lexicon.py and confirm this too. So a simple way to eliminate the DDLs is choosing #1 above. I will update my Lexicon to do so.

ericvaandering commented 10 years ago

Author: metson Replying to [comment:43 yuyi]:

dbsExceptionHandler is used because it helps logging. We don't have to repeat the logger(eCode + ": " + serverError) code everywhere we want to log the error. The dbsExceptionHandler provide the same interface for all the exceptions. In addition, HTTP error cannot cover all the errors we have. I want to be more explicitly for some error so it helps to pinpoint a problem.

I think you need to be careful about what you log. A user typing the wrong URL isn't something that needs to be in the logs IMHO.

I chose two different validation method for POST/PUT call and GET call because usually POST/PUT call have much more input data to validate so use the inline call to loop over all the data. On the other hand, Most GET calls have a few input as they just are the search criteria. Using the decorator to list all the possible values and types as input, it makes the validation very simple and shows exact what need to be validated at the top. One generic decorator can cover all the GET call validation. So less code.

I disagree that a single decorator for GET's is sufficient. You're at best moving the problem down the stack - e.g. you'll get a database error that becomes a 404 rather than bailing in the REST layer and returning a 400. Another way to look at it is if you see some input like "Totally Wrong Dataset Name" why ask the database to look that up and add load to the Oracle back end?

What's the reason to not use the "standard" RESTModel input validation? I forget if there was a reason or not...

I wrote less code if I use a decorator.

Fair enough.

We got understand that the POST/PUT call is that a client want to insert data into DBS or wants to do a searching with a list input that is not allowed to use any wildcards. With a GET call, in general a client is searching DBS with minimal information, they may use wildcards in their input. So I cannot restrict the names following format defined in Lexicon.py. Although, some database heavy GET call APIs require fully qualified input too, but this is checked by the APIs themselves. Here we are doing input validation and I understand that it focuses on if the input can caused any harm/attack to the server.

I think the input validation should be checking that the user input has a vague hope that the database won't reject it, which means making sure it's safe but also that it's vaguely correctly formed. Think of it as protecting the server and reducing the work that the database has to do.

Another thing to think about is DBS doesn't have a query interface anymore (that's provided by DAS) so it can be stricter than previously. It's no longer a user facing tool

Why we worry about mysql? We said that DBS3 will not support mysql last September, right?

Right, but the code is still there and if at some point in the future we have to add MySQL support (you never know...) you might be surprised that the protection that works for Oracle doesn't for MySQL. I'd remove the code from SVN to make it clear what is in use, restoring it from the SVN history is simple should if ever be needed.

As you pointed out the bind variables are used in dao. In fact, they are the fundamental protection for us to avoid a lot of sql attacks. There are more than "drop table" DDL are evil, should we check each of them in the input?

You don't need to check for them all. None of the strings people will be sending to DBS should look like a SQL statement, if you protect that dataset search strings are like dataset names with wildcards in you don't need to worry about drop table style queries (I really hope we never have a dataset called drop table!!)

I ''think'' this wouldn't happen for POST type queries, since they would use validateJSONInputNoCopy, validateStringInput and the regexps from Lexicon

As I said earlier, for searching the DB, we cannot enforce LFN, dataset, block and so on to match Lexicon.

You can be more restrictive than you are, you just need a different regexp :) If you know that the search is a dataset name you can restrict the input to strings like "/Higgs/*/RECO". I've attached a first pass at a Lexicon check that has search but restricts to patterns like dataset or block names. You could do similar for lfn's. I think PhEDEx may already have these regexps (I noticed an error from the datasvc that implies they're doing this sort of check) - probably worth mailing Tony and putting the same regexps into Lexicon, assuming they have them.

ericvaandering commented 10 years ago

Author: metson Please Review

ericvaandering commented 10 years ago

Author: metson Oops, changed the component by accident...

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:45 metson]:

Replying to [comment:43 yuyi]:

dbsExceptionHandler is used because it helps logging. We don't have to repeat the logger(eCode + ": " + serverError) code everywhere we want to log the error. The dbsExceptionHandler provide the same interface for all the exceptions. In addition, HTTP error cannot cover all the errors we have. I want to be more explicitly for some error so it helps to pinpoint a problem.

I think you need to be careful about what you log. A user typing the wrong URL isn't something that needs to be in the logs IMHO.

You are right that we are not logging every error. If it is a trivial error from a user, such as a wrong url or data type of a simple API, we are not logging it. Important errors are logged. So a exception handler hide this details.

I chose two different validation method for POST/PUT call and GET call because usually POST/PUT call have much more input data to validate so use the inline call to loop over all the data. On the other hand, Most GET calls have a few input as they just are the search criteria. Using the decorator to list all the possible values and types as input, it makes the validation very simple and shows exact what need to be validated at the top. One generic decorator can cover all the GET call validation. So less code.

I disagree that a single decorator for GET's is sufficient. You're at best moving the problem down the stack - e.g. you'll get a database error that becomes a 404 rather than bailing in the REST layer and returning a 400. Another way to look at it is if you see some input like "Totally Wrong Dataset Name" why ask the database to look that up and add load to the Oracle back end?

What's the reason to not use the "standard" RESTModel input validation? I forget if there was a reason or not...

I wrote less code if I use a decorator.

Fair enough.

We got understand that the POST/PUT call is that a client want to insert data into DBS or wants to do a searching with a list input that is not allowed to use any wildcards. With a GET call, in general a client is searching DBS with minimal information, they may use wildcards in their input. So I cannot restrict the names following format defined in Lexicon.py. Although, some database heavy GET call APIs require fully qualified input too, but this is checked by the APIs themselves. Here we are doing input validation and I understand that it focuses on if the input can caused any harm/attack to the server.

I think the input validation should be checking that the user input has a vague hope that the database won't reject it, which means making sure it's safe but also that it's vaguely correctly formed. Think of it as protecting the server and reducing the work that the database has to do.

I agree with you on this, but I don't know what is the proper rule for the vaguely correct format. I added/cced Valentin here. He may have some idea. From DAS side how he will think a format should be?

Another thing to think about is DBS doesn't have a query interface anymore (that's provided by DAS) so it can be stricter than previously. It's no longer a user facing tool

Yes. We can do it as long as we agree on the format.

Why we worry about mysql? We said that DBS3 will not support mysql last September, right?

Right, but the code is still there and if at some point in the future we have to add MySQL support (you never know...) you might be surprised that the protection that works for Oracle doesn't for MySQL. I'd remove the code from SVN to make it clear what is in use, restoring it from the SVN history is simple should if ever be needed.

Sure we can do that.

As you pointed out the bind variables are used in dao. In fact, they are the fundamental protection for us to avoid a lot of sql attacks. There are more than "drop table" DDL are evil, should we check each of them in the input?

You don't need to check for them all. None of the strings people will be sending to DBS should look like a SQL statement, if you protect that dataset search strings are like dataset names with wildcards in you don't need to worry about drop table style queries (I really hope we never have a dataset called drop table!!)

Do you mean the dataset searching string is something like /ABCD//? This means the user needs to know exactly where the part of the dataset name s/he knows should be in? Is it behind the 1st "/" or 2nd "/" ? I don't mind to do this, but we need all agree on this. This is a rather restrict searching. I hope we don't frustrate people when we ask them to provide a searching string for dataset like this.

I ''think'' this wouldn't happen for POST type queries, since they would use validateJSONInputNoCopy, validateStringInput and the regexps from Lexicon

As I said earlier, for searching the DB, we cannot enforce LFN, dataset, block and so on to match Lexicon.

You can be more restrictive than you are, you just need a different regexp :) If you know that the search is a dataset name you can restrict the input to strings like "/Higgs/*/RECO". I've attached a first pass at a Lexicon check that has search but restricts to patterns like dataset or block names. You could do similar for lfn's. I think PhEDEx may already have these regexps (I noticed an error from the datasvc that implies they're doing this sort of check) - probably worth mailing Tony and putting the same regexps into Lexicon, assuming they have them.

Thanks. I will look into your code tomorrow at work. As I said that I don't mind to do it as it might protect DBS server in some way. I hope we all in the same page on these, especially, I'd like to hear from Valentin since DAS is an important DBS client.

ericvaandering commented 10 years ago

Author: metson Replying to [comment:48 yuyi]:

I think the input validation should be checking that the user input has a vague hope that the database won't reject it, which means making sure it's safe but also that it's vaguely correctly formed. Think of it as protecting the server and reducing the work that the database has to do.

I agree with you on this, but I don't know what is the proper rule for the vaguely correct format. I added/cced Valentin here. He may have some idea. From DAS side how he will think a format should be?

Yeah, I think Val's input would be useful. I don't see it as a problem, but we should double check.

Another thing to think about is DBS doesn't have a query interface anymore (that's provided by DAS) so it can be stricter than previously. It's no longer a user facing tool

Yes. We can do it as long as we agree on the format.

Well, the format of dataset names, block names and lfn's are all well defined (and have been for years).

Right, but the code is still there and if at some point in the future we have to add MySQL support (you never know...) you might be surprised that the protection that works for Oracle doesn't for MySQL. I'd remove the code from SVN to make it clear what is in use, restoring it from the SVN history is simple should if ever be needed.

Sure we can do that.

Might be worth doing a general clean up - there's a load of java code, too. Maybe after the next release.

You don't need to check for them all. None of the strings people will be sending to DBS should look like a SQL statement, if you protect that dataset search strings are like dataset names with wildcards in you don't need to worry about drop table style queries (I really hope we never have a dataset called drop table!!)

Do you mean the dataset searching string is something like /ABCD//? This means the user needs to know exactly where the part of the dataset name s/he knows should be in? Is it behind the 1st "/" or 2nd "/" ? I don't mind to do this, but we need all agree on this. This is a rather restrict searching. I hope we don't frustrate people when we ask them to provide a searching string for dataset like this.

Again, remember that DBS isn't really a user facing tool any more. The query language is managed by DAS. So long as we can agree with Valentin a suitable search API for DAS we can leave the rest to him. IMHO that means the DBS3 search api can be more restrictive than the DBS2 one, and we should take lessons from the previous version and it's usage.

You can be more restrictive than you are, you just need a different regexp :) If you know that the search is a dataset name you can restrict the input to strings like "/Higgs/*/RECO". I've attached a first pass at a Lexicon check that has search but restricts to patterns like dataset or block names. You could do similar for lfn's. I think PhEDEx may already have these regexps (I noticed an error from the datasvc that implies they're doing this sort of check) - probably worth mailing Tony and putting the same regexps into Lexicon, assuming they have them.

Thanks. I will look into your code tomorrow at work. As I said that I don't mind to do it as it might protect DBS server in some way. I hope we all in the same page on these, especially, I'd like to hear from Valentin since DAS is an important DBS client.

You could tweak the regexp to allow a '' search, but that's currently not in there. I think I'd prefer that DAS managed that (e.g. did ''->'///*'). That said, in the past such open queries caused problems, especially in DD. I don't think it's unreasonable to require some constraints on searches. The other option is significant schema testing and optimisation so that these open queries aren't an issue.

ericvaandering commented 10 years ago

Author: valya Replying to [comment:49 metson]:

Replying to [comment:48 yuyi]:

I think the input validation should be checking that the user input has a vague hope that the database won't reject it, which means making sure it's safe but also that it's vaguely correctly formed. Think of it as protecting the server and reducing the work that the database has to do.

I agree with you on this, but I don't know what is the proper rule for the vaguely correct format. I added/cced Valentin here. He may have some idea. From DAS side how he will think a format should be?

Yeah, I think Val's input would be useful. I don't see it as a problem, but we should double check.

Don't treat DAS as AI system. My opinion that underlying data-service should perform its syntax/semantics checking. For instance, user should be allowed to pass non-existent dataset path, which satisfies correct syntax, e.g. dataset=/a/b/c. If dataset=/a/b/c does not exists in DBS it is not an ERROR, since path is correct. It is the fact that dataset does not exists in DBS. And, the only way to find what exists in DBS is to query DB, rather then tweak lexicon.

You can be more restrictive than you are, you just need a different regexp :) If you know that the search is a dataset name you can restrict the input to strings like "/Higgs/*/RECO". I've attached a first pass at a Lexicon check that has search but restricts to patterns like dataset or block names. You could do similar for lfn's. I think PhEDEx may already have these regexps (I noticed an error from the datasvc that implies they're doing this sort of check) - probably worth mailing Tony and putting the same regexps into Lexicon, assuming they have them.

Thanks. I will look into your code tomorrow at work. As I said that I don't mind to do it as it might protect DBS server in some way. I hope we all in the same page on these, especially, I'd like to hear from Valentin since DAS is an important DBS client.

You could tweak the regexp to allow a '' search, but that's currently not in there. I think I'd prefer that DAS managed that (e.g. did ''->'///*'). That said, in the past such open queries caused problems, especially in DD. I don't think it's unreasonable to require some constraints on searches. The other option is significant schema testing and optimisation so that these open queries aren't an issue.

I think we agreed that usage of wild-cards should be limited to certain APIs. If dataset API allows wild-card I don't want to perform special tweaking to convert * -> ///. It is dagerous per-se. For example, one user can type * which means any dataset, while another can type *Run. In laster case it is unclear to which part of the path it should be applied. Should it be /Run// or /_/Run/_. How about EC pattern, it can be in three different places, /EC//, /_/EC/ or //_/EC (yes it does match RECO tier). So It's up to DBS to pass such wild-card to all datasets and find correct result set, rather then a job of Lexicon to decide about correctness and placement of the pattern. Yes, I do understand the optimization issue, but we said before that wild-card would be limited to certain APIs and some APIs just needs to accept it. For instance, dataset API should do that and I don't see any optimization issue here, since DB will use index on a single DB column.