GoogleCloudPlatform / data-science-on-gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Apache License 2.0
1.31k stars 712 forks source link

Ch.4 - Time Correction - input/output error when running df02.py #162

Open jgammerman opened 1 year ago

jgammerman commented 1 year ago

Hi! I'm getting the following error when running df02.py (df01.py worked fine) - any advice please?

(beam_env) jgammerman@cloudshell:~/data-science-on-gcp/04_streaming/transform (peppy-booth-371612)$ python3 ./df02.py Traceback (most recent call last): File "apache_beam/runners/common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 624, in apache_beam.runners.common.SimpleInvoker.invoke_process File "/home/jgammerman/beam_env/lib/python3.9/site-packages/apache_beam/transforms/core.py", line 1879, in <lambda> File "/home/jgammerman/data-science-on-gcp/04_streaming/transform/./df02.py", line 39, in <lambda> File "/home/jgammerman/data-science-on-gcp/04_streaming/transform/./df02.py", line 24, in addtimezone File "/home/jgammerman/beam_env/lib/python3.9/site-packages/timezonefinder/timezonefinder.py", line 260, in __init__ File "/home/jgammerman/beam_env/lib/python3.9/site-packages/timezonefinder/timezonefinder.py", line 92, in __init__ OSError: [Errno 5] Input/output error: '/home/jgammerman/beam_env/lib/python3.9/site-packages/timezonefinder/poly_zone_ids.bin'

Followed by some more output (omitting for brevity) which ends as follows:

RuntimeError: OSError: [Errno 5] Input/output error: '/home/jgammerman/beam_env/lib/python3.9/site-packages/timezonefinder/poly_zone_ids.bin' [while running 'Map(<lambda at df02.py:39>)'] Exception ignored in: <function AbstractTimezoneFinder.__del__ at 0x7f1142a9a1f0> Traceback (most recent call last): File "/home/jgammerman/beam_env/lib/python3.9/site-packages/timezonefinder/timezonefinder.py", line 97, in __del__ AttributeError: poly_zone_ids

jgammerman commented 1 year ago

Update -I've run df02.py twice more, and now I'm getting an out-of-memory error after nearly an hour of running:

Bus error (core dumped)

Same story with df03.py, but df04.py seemed to work okay.

Is anyone else having this problem? And is it supposed to take 45-60 mins to run each file?

luisandrecunha commented 10 months ago

@jgammerman were you able to solve the long time issue to run the beam pipelines? I'm having the same situation with df04.py, I even reduced the number of flights to apply the transformation to 100, and still, no luck!

jgammerman commented 10 months ago

No unfortunately not Luis. I just moved on.

On Tue, 14. Nov 2023 at 04:53, Luís Cunha @.***> wrote:

@jgammerman https://github.com/jgammerman were you able to solve the long time issue to run the beam pipelines? I'm having the same situation with df04.py, I even reduced the number of flights to apply the transformation to 100, and still, no luck!

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/data-science-on-gcp/issues/162#issuecomment-1809547523, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAXKXBNFPML3M73P6RYBQTYEL2MXAVCNFSM6AAAAAATG5KBI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGU2DONJSGM . You are receiving this because you were mentioned.Message ID: @.***>