Chicago / open-data-etl-utility-kit

Use Pentaho's open source data integration tool (Kettle) to create Extract-Transform-Load (ETL) processes to update a Socrata open data portal. Documentation is available at http://open-data-etl-utility-kit.readthedocs.io/en/stable
Other
95 stars 30 forks source link

Issue 3 #30

Closed jefw closed 8 years ago

jefw commented 8 years ago

Per Issue 3, created A_DatasetLogs.bat, A_ETL_Runtimes.bat, A_RunETL.bat, A_TodayLogs.bat to mimic the functionality of the corresponding shell scripts. Care has been taken to avoid dependencies, other than a modern windows command interpreter. However:

Without a set of example log files, testing has been very limited, please construct your test plan accordingly.

jefw commented 8 years ago

Signed the agreement via clahub.

levyj commented 8 years ago

Thank you! We will take a look at these.

As they say, no good deed goes unpunished. Any chance you want to take a swing at updating the corresponding page in the documentation? If not, that is totally fine.

jefw commented 8 years ago

Yes, no problem. Just glancing at the docs made me realize A_RunETL.bat needs changes anyway. I only took it as far as capturing and outputting the command, not actually running it. There's no real equivalent to eval in the CMD.EXE, so I may have to resort to making a temporary file.

jefw commented 8 years ago

Okay - I've updated the docs and the A_RunETL.bat to run the commands. Out of curiosity, what do you intend for the RunETL script to do in case multiple jobs are matched? I imagine it would be a fairly common failure mode for the user to accidentally press enter prematurely, maybe giving something like:

$ sh RunETL.sh t [oops I hit enter while aiming for the "3" on the keypad]

In this case the script would match any jobs with a "t" - which could be a lot of stray jobs.

The shell script would compound all this onto one line, due to the use of eval, and the command would almost certainly fail to run.

The batch file, however, would run each job one by one, as the tasks are being parsed out in a loop, and the call inside this.

Let me know what the desired behavior is, or raise a new issue, and I can tackle this if you want.

levyj commented 8 years ago

Reasonable point. The desired behavior would be to fail without running any jobs.

I have not really thought through ease of implementation in either a shell script or batch file but I suppose a good validity check would be to make sure the parameter is nine characters (good) or four alphanumeric dash four-alphanumeric (better). Nine with the middle one being a dash would be almost as good.

All that said, I have done so much worse things by mistake than running some ETLs needlessly so we can also live with the risk if necessary. ☺

Thanks.

From: Jef Waltman [mailto:notifications@github.com] Sent: Friday, October 16, 2015 10:03 AM To: Chicago/open-data-etl-utility-kit open-data-etl-utility-kit@noreply.github.com Cc: Levy, Jonathan Jonathan.Levy@cityofchicago.org Subject: Re: [open-data-etl-utility-kit] Issue 3 (#30)

Okay - I've updated the docs and the A_RunETL.bat to run the commands. Out of curiosity, what do you intent for the RunETL script to in case multiple jobs are matched? I imagine it would be a fairly common failure mode for the user to accidentally press enter prematurely, maybe giving something like: $ sh RunETL.sh t [oops I hit enter while aiming for the "3" on the keypad]

In this case the script would match any jobs with a "t" - which could be a lot of stray jobs.

The shell script would compound all this onto one line, due to the use of eval, and the command would almost certainly fail to run.

The batch file, however, would run each job one by one, as the tasks are being parsed out in a loop, and the call inside this.

Let me know what the desired behavior is, or raise a new issue, and I can tackle this if you want.

— Reply to this email directly or view it on GitHubhttps://github.com/Chicago/open-data-etl-utility-kit/pull/30#issuecomment-148740542.


This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.

jefw commented 8 years ago

Added validation requiring the Dataset4x4 argument to be (exactly) 4 alphanum dash 4 alphanum per your suggestion. Hooray regex. I think this is a reasonable safeguard. It's not so much running an ETL that I would worry about, it's other stuff that may be lurking in the crontab or task scheduler.

tomschenkjr commented 8 years ago

A_DatasetLogs.bat - looks great. A_ETLRuntimes.bat - something is not quite right. Here is a zipped folder of sample logs (sorry, this would have been helpful for you) and the commands didn't quite work. The parsing seems to be stopping after the header. A_TodayLogs.bat - terrific.

For A_ETLRuntimes.bat

Shell output (correct):

$ A_ETLRuntimes.sh f7f2-ggz5
INFO  01-08 06:00:40,695 - Kitchen - Processing ended after 19 seconds.
INFO  02-08 06:00:39,526 - Kitchen - Processing ended after 18 seconds.

Batch file output:

> A_ETLRuntimes.bat k7hf-8y75
WARN  01-08 00:45:20,949 - Unable to load Hadoop Configuration from "file:///pat
h/to/directory/data-integration/plugins/pentaho-big-data-plugin/hadoop-configura
tions/mapr". For more information enable debug logging.
INFO  01-08 00:45:21,130 - Kitchen - Logging is at level : Detailed logging
INFO  01-08 00:45:21,131 - Kitchen - Start of run.
INFO  01-08 00:45:21,240 - Standard_ETL - Start of job execution
INFO  01-08 00:45:21,242 - Standard_ETL - exec(0, 0, START.0)
INFO  01-08 00:45:21,246 - START - StartinNot enough storage is available to pro
cess this command.

In another example:

$ A_ETLRuntimes.sh k7hf-8y75
INFO  01-08 00:45:31,837 - Kitchen - Processing ended after 10 seconds.
INFO  01-08 01:45:31,532 - Kitchen - Processing ended after 10 seconds.
INFO  01-08 02:45:35,323 - Kitchen - Processing ended after 13 seconds.
INFO  01-08 03:45:31,021 - Kitchen - Processing ended after 10 seconds.
INFO  01-08 04:45:33,491 - Kitchen - Processing ended after 9 seconds.
INFO  01-08 05:45:35,705 - Kitchen - Processing ended after 10 seconds.
INFO  01-08 06:45:30,327 - Kitchen - Processing ended after 9 seconds.
INFO  01-08 07:45:30,401 - Kitchen - Processing ended after 9 seconds.
INFO  01-08 08:45:30,755 - Kitchen - Processing ended after 9 seconds.
INFO  01-08 09:45:30,992 - Kitchen - Processing ended after 11 seconds.
INFO  01-08 10:45:30,064 - Kitchen - Processing ended after 10 seconds.
INFO  01-08 11:45:32,956 - Kitchen - Processing ended after 13 seconds.
INFO  01-08 12:45:29,006 - Kitchen - Processing ended after 10 seconds.
INFO  01-08 13:45:31,072 - Kitchen - Processing ended after 12 seconds.
INFO  01-08 14:45:34,327 - Kitchen - Processing ended after 13 seconds.
INFO  01-08 15:45:32,042 - Kitchen - Processing ended after 13 seconds.
INFO  01-08 16:45:36,774 - Kitchen - Processing ended after 15 seconds.
INFO  01-08 17:45:30,050 - Kitchen - Processing ended after 11 seconds.
INFO  01-08 18:45:30,543 - Kitchen - Processing ended after 10 seconds.
INFO  01-08 19:45:30,866 - Kitchen - Processing ended after 11 seconds.
INFO  01-08 20:45:30,196 - Kitchen - Processing ended after 9 seconds.
INFO  01-08 21:45:27,508 - Kitchen - Processing ended after 9 seconds.
INFO  01-08 22:45:27,525 - Kitchen - Processing ended after 9 seconds.
INFO  01-08 23:45:30,555 - Kitchen - Processing ended after 10 seconds.
INFO  02-08 00:45:31,668 - Kitchen - Processing ended after 9 seconds.
INFO  02-08 01:47:31,540 - Kitchen - Processing ended after 2 minutes and 14 sec
onds (134 seconds total).
INFO  02-08 02:45:29,161 - Kitchen - Processing ended after 10 seconds.
INFO  02-08 03:45:28,886 - Kitchen - Processing ended after 10 seconds.
INFO  02-08 04:45:32,704 - Kitchen - Processing ended after 9 seconds.
INFO  02-08 05:45:35,135 - Kitchen - Processing ended after 10 seconds.
INFO  02-08 06:45:30,211 - Kitchen - Processing ended after 10 seconds.
INFO  02-08 07:45:28,853 - Kitchen - Processing ended after 10 seconds.
INFO  02-08 08:45:28,631 - Kitchen - Processing ended after 9 seconds.
INFO  02-08 09:45:32,719 - Kitchen - Processing ended after 14 seconds.
INFO  02-08 10:45:26,688 - Kitchen - Processing ended after 10 seconds.
INFO  02-08 11:45:28,690 - Kitchen - Processing ended after 10 seconds.
INFO  02-08 12:45:28,745 - Kitchen - Processing ended after 10 seconds.
INFO  02-08 13:45:28,885 - Kitchen - Processing ended after 10 seconds.
INFO  02-08 14:45:27,470 - Kitchen - Processing ended after 10 seconds.

The bash script has the same Hadoop error as above. The grep/find doesn't seem to be working, but unclear to me as why.

jefw commented 8 years ago

Okay - I'll take a look and debug using your sample logs. I'm pretty booked for the next couple of weeks, but will work on it as catch can.

jefw commented 8 years ago

@tomschenkjr - can you double check that those sample files are still available on FileTea? I get a blank page when following the URL you posted.

tomschenkjr commented 8 years ago

@jefw -- Ok, adjusted the link. That should work.

jefw commented 8 years ago

@tomschenkjr - Retrieved in good order - thanks.

jefw commented 8 years ago

Okay - fixed in my branch. I was using copy to get file content into the find command, but this was stopping after the first file. Switched to type instead. I have a dim recollection that type is limited to files < 2GB but couldn't quickly verify this. The .bat now outputs the same as the .sh when tested with k7hf-8y75. Please re-test and advise.

I am not sure what you mean about the Hadoop errors. These appears in the log files themselves, and should be excluded by grep/find.

tomschenkjr commented 8 years ago

Thanks, I'll test to confirm.

For the Hadoop ... when I was running the command, it was displaying a Hadoop error that was contained in the logs -- which just happened to be the first line.

Tom Schenk Jr.

Chief Data Officer

Department of Innovation and Technology

City of Chicago

(312) 744-2770

tom.schenk@cityofchicago.org

data.cityofchicago.org


From: Jef Waltman notifications@github.com Sent: Tuesday, December 8, 2015 1:37 PM To: Chicago/open-data-etl-utility-kit Cc: Schenk, Tom Subject: Re: [open-data-etl-utility-kit] Issue 3 (#30)

Okay - fixed in my branch. I was using copy to get file content into the find command, but this was stopping after the first file. Switched to type instead. I have a dim recollection that type is limited to files < 2GB but couldn't quickly verify this. The .bat now outputs the same as the .sh when tested with k7hf-8y75. Please re-test and advise.

I am not sure what you mean about the Hadoop errors. These appears in the log files themselves, and should be excluded by grep/find.

Reply to this email directly or view it on GitHubhttps://github.com/Chicago/open-data-etl-utility-kit/pull/30#issuecomment-162991947.

[https://avatars3.githubusercontent.com/u/7562476?v=3&s=400]https://github.com/Chicago/open-data-etl-utility-kit/pull/30#issuecomment-162991947

Issue 3 by jefw · Pull Request #30 · Chicago/open-data-etl ... Per Issue 3, created A_DatasetLogs.bat, A_ETL_Runtimes.bat, A_RunETL.bat, A_TodayLogs.bat to mimic the functionality of the corresponding shell scripts. Care has been ... Read more...https://github.com/Chicago/open-data-etl-utility-kit/pull/30#issuecomment-162991947


This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.

tomschenkjr commented 8 years ago

Succes :+1:

Thank you.

jefw commented 8 years ago

Awesome!