futuregrid / cloud-metrics

Project to create usage statistics from IaaS such as OpenStack, Eucalyptus, and Nimbus
2 stars 3 forks source link

Manual page for fg-euca-log-parser #3

Closed laszewsk closed 11 years ago

laszewsk commented 12 years ago

The fg-parse command is a template on how we can manage euca log files

it shows how it can be run in two modes.

a) a mode in which just a log file is being managed b) a mode in which the log files are parsed and the contents is written into a database

With increased data volume, the second mode is to be preferred. A proper documentation as manual page in our program needs to be devised that explains usage. before any development is continued, this manual page must be devised first.

We may actually have different other programs that use the data that are stored in the database, such as the creation of web pages and analytics based on data stored in the database.

We require that these programs be documented first. provide a proper argparse framework

lee212 commented 12 years ago

It seems you already made a draft manual notes on the FGParser.py file and it looks well defined about what we need/want to implement. Here are my questions. Your manual page seems our future plan to implement. It is because of that most of the arguments are NOT working now. Do you want to define and write a draft manual page about a way of use fg-* commands with proper arguments? If we are on the same page, I will start to think about what arguments we want to use and need to have and then adopt the plan into argparse.

laszewsk commented 12 years ago

I like that we implement ASAP what we have in the manual ... ;-)

you could augment the manual with comments what is not working

the reason i did this is so you know here we are movng too. if there is a problem with what I suggested (which I forgot by now, you may want to sop by the office and discuss what to do and not, or have a skype meeting)

On Wed, Feb 29, 2012 at 4:43 PM, lee212 < reply@reply.github.com

wrote:

It seems you already made a draft manual notes on the FGParser.py file and it looks well defined about what we need/want to implement. Here are my questions. Your manual page seems our future plan to implement. It is because of that most of the arguments are NOT working now. Do you want to define and write a draft manual page about a way of use fg-* commands with proper arguments? If we are on the same page, I will start to think about what arguments we want to use and need to have and then adopt the plan into argparse.


Reply to this email directly or view it on GitHub:

https://github.com/futuregrid/eucalyptus-cloud-metrics/issues/3#issuecomment-4247459

fugangwang commented 12 years ago

As the final goal is to produce statistics results for each month, or any time period we may want to look into, I think the high priority is the 'from' and 'to' datetime. So you may want to do these: 1> Add the argparse framework. The good thing is the content for '-h/--help' would be automatically generated according to the options you have. So this serves as the simplified manual first. 2> Implement the handling of the 'from' and 'to' parameters; 3> Maybe also add the 'db' or 'file/stdout' switch/options. So you can have the result output directly to screen; or write the parsed instances data into db. 4> Test on some log files to see if the parsed data are correct - as mentioned the pending/extant/terminated start/end datetime may not be correct, if so you should fix this. If your original perl script can produce the same statistics, compare the results from the two set of scripts. 5> Do some estimation how long would it take to process all the data we have so far. E.g., let's say it takes 1 minute to process 100M log file(s), we then have a speed 100M/min. You can also run several small test to see if this is a reliable estimation. This can give us a rough number of running time so we can prepare to run on the whole data during the weekend(by specifying the 'from' and 'to' time period to the earliest date (since we have the log data) and the current date) 6> As Gregor also mentioned, the right architecture should be process the log -> store the interesting data -> other programs read the data and present it (possibly after some less computing intensive processing). The parser script focuses on the first 2 steps (however for testing purpose it can generate charts directly). Once we have the statistics in the database, a separate program, which could be a python script again, or php+javascript from the portal, could then retrieve the data and display the results. 7> Once we have the whole workflow worked out, you can then think about add/implement more options if desired. A more formal manual page could be produced then.

lee212 commented 12 years ago

Let me update manual pages for the scripts with ONLY the currently working parameters. (i.e. -i, -o, -s_date, and -e_date) I will keep the undeveloped parameters back till we have implemented that. Fugang - Yes, I already used argparse to all python scripts and there is no perl script anymore, all are converted to python. Also, I added -s_date -e_date as from and to parameters but I don't have functions to save the result to database. So, I am using CSV file to save the analyzed data by FGParser.py instead of mysql database. (See make_report function in the file) It seems result tables and retrieval functions need to be added to. All my current work is in the 'easy_install' branch.

lee212 commented 12 years ago

On my second thought, if someone can provide me db tables and retrieval functions, storing data to db would be much easier. Currently, we have a csv file look like: user.data.of.eucalyptus.india.20111231-20111231.csv name, count, sum, min, avg steenoven, 5, 6332270.0, 1266454.0, 1266454.0 ...

It doesn't have date on the record, so our result table would be like Date, ownerId, used seconds, (max), (min), (avg), number of running instances, instances 2012-02-21, aghegde, 6332270.0, 1266454.0, 1266454.0, 1266454.0, 2, i-3A220843;i-4B760819 ... I can see some functions in FGEucaMetricsDB.py but I am not sure how to use it and it looks dedicated to the instance table. Once I have a new table and functions for the result, web displaying (/log-analyzer/www/cgi-bin/metrics.cgi) can use database instead of csv files.

fugangwang commented 12 years ago

Hyungro, we are not saving the user statistics results into the db at this time. What we save is the data for instances, in which the ownerId, starting/ending/pending/extant/termination times, etc. are saved. Since it's time-consuming to get these data, we run through the log and save these into the db. Then, from this data, we can produce the statistics for instances directly; or it needs very little processing to get the user statistics also.

So, in your make_report(), after these: parse_file(input_dir + "/" + filename, instances.add, debug=False, progress=True) instances.calculate_delta () and before this: instances.calculate_user_stats (users)

You could add: instances.write_to_db() so the processed instances are stored into the db as well.

So next time to generate the report, in stead of using parse_file(), we could instead call instances.read_from_db() to populate the instances data.

See also the 2 test: test_sql_read() test_sql_write

fugangwang commented 12 years ago

In my opinion, it does not worth it to store the user statistics to the table since it would be very dynamic and the values would be changed very often when new data comes in. In contrast to this, after an instance reached the end of its life cycle, the associated data would be permanent. And as already discussed it seems the time-consuming portion is to get the instances data. While we have them in the db, it would be quite simple to generate statistics in the dimension of user.

lee212 commented 12 years ago

I see your point, Fugang. It does make sense. Maybe I had a different approach. Since I realized what information the log files have such as instanceid, started time, user id, etc, I was sure analyzed data in some type would be used to manipulate any type of report and graph.

About the db functions, it seems easy to use. I will try to use it again. Actually, last time when I tested write_to_db() function I got unknown errors due to abnormal log lines so I just skipped to use the function.

Also, I am not sure why we keep backup log files even though we have databases for that. I feel like we need to have only one resources, backup log files or databases. If we want to use database which I prefer, I guess backup files can be abandoned.

fugangwang commented 12 years ago

If the error is related to the instance data, then please report the error so we can test further again the db interface. We only used some limited data testing it anyway.

The log backup should not be abandoned since it contains all the original info which may not be captured into the db but may be needed in the future. After a log file is processed and the instances info were populated into the db, we can load that portion of the info from the db while the log file itself could be archived (I agree it should be moved to a separate directory maybe, but we may need to change the log decompressor/merger as well).

lee212 commented 12 years ago

Quick question. If we don't save the analyzed data (output by fg-parser), do we analyze data everytime through the instance table when we want to populate graphs and reports? what if it's about month range and year range data? Let's say, a user click a web graph page to display utilization data between 2011 ~ 2012, do we run fg-parser.py to analyze instance data through database? Even if it's not taking a long time, I don't feel it's a good way to implement. We can provide users a static graph web pages by pre-generated gif,png images but it's not dynamic graph to handle user's request. This question keeps bothering me and telling me that I need to save the analyzed data at somewhere in generalized form that make it usable in a diverse way. In other words, I feel we need to separate parser and display (from parsed data), and database for saving the data.

laszewsk commented 12 years ago

thats the reason why we ask how long it takes ;-)

at this time we can just recalculate ... at least for the stuff we want.

On Thu, Mar 1, 2012 at 3:04 PM, lee212 < reply@reply.github.com

wrote:

Quick question. If we don't save the analyzed data (output by fg-parser), do we analyze data everytime through the instance table when we want to populate graphs and reports? what if it's about month range and year range data? Let's say, a user click a web graph page to display utilization data between 2011 ~ 2012, do we run fg-parser.py to analyze instance data through database? If it's not taking a long time, I don't feel it's a good way to implement. We can provide users a static graph web pages by pre-generated gif,png images but it's not dynamic graph to handle user's request. This question keeps bothering me and telling me that I need to save the analyzed data at somewhere in generalized form that make it usable in a diverse way.


Reply to this email directly or view it on GitHub:

https://github.com/futuregrid/eucalyptus-cloud-metrics/issues/3#issuecomment-4269078

fugangwang commented 12 years ago

In a web environment, you could load all the instance data (only once per analysis session) from the db and process that on the client side, e.g., in php or javascript. The results data would be stored in memory temporarily. We don't have to run parser again. We just want to present the data in another perspective. Given an array/list of instance data with starting/ending time and the ownerid, it would be quite simple to generate any statistics a user may want, in a dynamic way. In other words, we have a rich client, not a thin client which simply display existing data/graph. If you then found such processing is time/space costly, you sure could store some of them, or even all of them if you could enumerate all the possible requests from the users. For your last argument, that was true. As Gregor also mentioned multiple programs may use the database. In one of my previous comment I also stated the process of processing->store->presenting. Currently the parser has kind of a hybrid role because a> easy for debug/testing with small amount of data; b> could generate statistics in command line environment.

lee212 commented 12 years ago

If we are going to publish our fg-euca-log-parser to the public using pypi, we may want to make our program simpler, I guess. In the meaning of a simple program, we might not ask a user to keep all backup cc.log files like we do. Also, we might want to give an option to a user about using database or not. I don't want change our original plans and implementations but I want to make sure we all agree this view for our future implementation.

fugangwang commented 12 years ago

Whether to keep the log files or not is a deployment/admin issue. The parser program itself would not make any assumption on this. We make recommendations on how to prepare the log files (with also tools for that) and user is free to remove the processed logs. We are not forcing to use db either, but it is recommended. The parser program itself runs no problem and spits out results before we add the database layer. However by introducing the db layer we store the extracted relevant info so we don't have to compute them again and again. Furthermore the results could be consumed by multiple entities no matter it's a command line tool, a web site, or even integrated into the drupal portal. Essentially we could also use other means of data persistence mechanism, files for instance. However in this case you will have to implement the CRUD operations yourself for the file based solution.

laszewsk commented 12 years ago

Our code that we push into pypi is meant for analysing all log files gathered by eucalyptus. It has the following components to it

a) a program that gathers all log files into a backup and manages the names of the log files (done) b) a database framework that allows ingestion of any log information into the database while already included data in the database is either ignored or does not change the data base (done for print_ccInstances) others need to follow e.g. the resources c) a framework that produces useful statistics from the log files d) a framework to display the statistics by users, and in reports

It will be up to the user of this program to make the decision if he want to delete log files, compress them or remove portionso of the log files to safe space. Our own usecase is motivated by maintaining at this time all log files unchanged.

By the way

a) the current redesigned program is extremely simple. b) the current redesigned program we provide allows ingesting data into the db at any time in any order c) the current redesigned program we provides also allows the analysis of data outside of the database if desired

A program that does not have the capability to analyse all log files is for us useless and can not be our development nor in our release target.

if a user likes to support a subset of data, a user can always do that by deleting files and running our programs. Something which if you find important you could provide a documentation about or write a 10 line python script that you add on to our framework that deletes log files that are older than say x days and reruns the analysis. However we will keep for now all of our log files.

I amnot sure if you also noticed that much of the processing of the data could be moved into simple sql queries, however for the analysis of the instances, our data structure is quite convenient and one can do this in memory. There is a function that rereads that information from the db ... The result is that analysis of various time intervals can be done very fast in memory.

On Thu, Mar 1, 2012 at 4:49 PM, lee212 < reply@reply.github.com

wrote:

If we are going to publish our fg-euca-log-parser to the public using pypi, we may want to make our program simpler, I guess. In the meaning of a simple program, we might not ask a user to keep all backup cc.log files like we do. Also, we might want to give an option to a user about using database or not. I don't want change our original plans and implementations but I want to make sure we all agree this view for our future implementation.


Reply to this email directly or view it on GitHub:

https://github.com/futuregrid/eucalyptus-cloud-metrics/issues/3#issuecomment-4271078

laszewsk commented 12 years ago

I agree.

On Thu, Mar 1, 2012 at 5:25 PM, Fugang Wang < reply@reply.github.com

wrote:

Whether to keep the log files or not is a deployment/admin issue. The parser program itself would not make any assumption on this. We make recommendations on how to prepare the log files (with also tools for that) and user is free to remove the processed logs. We are not forcing to use db either, but it is recommended. The parser program itself runs no problem and spits out results before we add the database layer. However by introducing the db layer we store the extracted relevant info so we don't have to compute them again and again. Furthermore the results could be consumed by multiple entities no matter it's a command line tool, a web site, or even integrated into the drupal portal. Essentially we could also use other means of data persistence mechanism, files for instance. However in this case you will have to implement the CRUD operations yourself for the file based solution.


Reply to this email directly or view it on GitHub:

https://github.com/futuregrid/eucalyptus-cloud-metrics/issues/3#issuecomment-4271765

lee212 commented 11 years ago

We put the manual pages on github. For this command, we have its own page here: http://futuregrid.github.com/cloud-metrics/man/fg-parser.html