USF-IMARS / imars_dags

:leaves: USF IMaRS Airflow DAGs
0 stars 0 forks source link

test that ingest-ftp works for new digitalGlobe files #86

Closed 7yl4r closed 5 years ago

7yl4r commented 6 years ago

@sebastiandig :

I thought I had everything set to go for Sunday, but it looks like the ingest_ftp dag did not run.

I would suggest logging into the "test" airflow server and manually triggering a DagRun of the ingest_ftp dag whenever you are ready for it to run. Once it starts you should be able to see the status there & look at the log to see if it is working as expected.

It should trigger next Sunday, but I guess we will see.

sebastiandig commented 6 years ago

I did notice that the files were still in the ftp-ingest folder, I will do a test however, I would need the credentials again to log into Airflow.

sebastiandig commented 6 years ago

I tried to run the 'ingest-ftp' by clicking the 'trigger dag'. So far I have not seen any file movement, but the dag run status has one running. I figure this might take awhile so I'll be patient.

sebastiandig commented 6 years ago

image

This may or may not be a problem. I've noticed that at the second to last line, it says skipping the file path, I just wanted clarification on what that means. If it is skipping the file, I noticed the pattern doesn't match the exact way I've been naming them. The only difference is I added seconds after %M, and that the website doesn't allow '-' between the year month and day, they turn to underscores. Lastly, if I name them with wv3 instead of wv2, would that have an effect? I hope this isn't an issue, but if it is, that it would be an easy fix.

7yl4r commented 6 years ago

Yes that is an issue that I will fix right now.

wv2 vs wv3 are different, but I prefer to keep it that way. I will need to add a line to handle wv3. Another small fix.

So I think this DAG should run, "succeed", and load 0 files. It should be faster but I guess it is hashing the files before it realizes they don't match the pattern; that's a bad design on my part.

7yl4r commented 6 years ago

I think cfdcf6e has things finally fixed up.

sebastiandig commented 6 years ago

Should I delete the running dag and restart to update it or does it do that automatically?

7yl4r commented 6 years ago

The server software will update next time puppet runs (every 30min).

To play it safe I suggest we let that DagRun run its course and then start a new one after the server has updated.

sebastiandig commented 6 years ago

Roger that.

7yl4r commented 6 years ago

Actually I have found a bug in the filename parser and need a bit more time to fix.

tylar@tylardesk:~/imars-etl$ nosetests3 imars_etl/drivers/imars_objects/load_test.py
E
======================================================================
ERROR: Load zip_wv2_ftp_ingest
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tylar/imars-etl/imars_etl/drivers/imars_objects/load_test.py", line 39, in test_load_zip_wv2_ftp_ingest
    load_file(**test_args),
  File "/home/tylar/imars-etl/imars_etl/drivers/imars_objects/load_file.py", line 16, in load_file
    ul_target = format_filepath(**kwargs)
  File "/home/tylar/imars-etl/imars_etl/filepath/format_filepath.py", line 73, in format_filepath
    raise k_err
  File "/home/tylar/imars-etl/imars_etl/filepath/format_filepath.py", line 66, in format_filepath
    (fullpath).format(**args_dict)
KeyError: 'area_short_name'
-------------------- >> begin captured logging << --------------------
imars_etl.filepath.format_filepath._format_filepath_template: INFO: placing None (#6)...
imars_etl.filepath.format_filepath.format_filepath: INFO: formatting imars-obj path 
>>'/srv/imars-objects/{area_short_name}/zip_wv2_ftp_ingest/wv2_2017_03_01T223344_rb_123456789_10_0.zip'
imars_etl.filepath.format_filepath.format_filepath: ERROR: cannot guess an argument required to make path.  pass this argument manually using --json 
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 0.002s

FAILED (errors=1)

I'll let you know when it is fixed.

7yl4r commented 6 years ago

latest imars-etl v05.5 should be good and on the servers in ~30m.

sebastiandig commented 6 years ago

excellent, I hope it work. I'll check the status tomorrow.

7yl4r commented 5 years ago

So I have made a bunch of changes to imars-etl so I am going to do some manual testing and document it here.


Load 1 file manually:

[airflow@imars-airflow16 ~]$ python3 -m imars_etl -vvv load         --product_id 6         --json '{"status_id":3}'        -f /srv/imars-objects/ftp-ingest/wv2_2018_10_08T125418_monroe_058613071_10_0.zip --nohash
[2018-11-06 16:18:54,175] {settings.py:174} INFO - setting.configure_orm(): Using pool settings. pool_size=1000, pool_recycle=3600
[2018-11-06 16:18:54,657] {Load.py:143} INFO - ------- loading file wv2_2018_10_08T125418_monroe_058613071_10_0.zip ----------------

[2018-11-06 16:18:54,668] {parse_filepath.py:152} INFO - params parsed from fname: 
    {'dt_Y': 2018, 'dt_M': 1, 'dt_d': 8, 'dt_H': 1254, 'dt_S': 8, 'dt_m': 10, 'order_id': 58613071, 'area_short_name': 'monroe'}
[2018-11-06 16:18:54,668] {timestrings.py:24} DEBUG - partial datetime found
[2018-11-06 16:18:54,668] {timestrings.py:30} DEBUG - time str is just right.
[2018-11-06 16:18:54,670] {BaseHookHandler.py:102} INFO - getting hook for conn_id 'fallback_chain.local_metadb.imars_metadata'
[2018-11-06 16:18:54,673] {BaseHookHandler.py:117} WARNING - Chained connection 'local_metadb' not found.
[2018-11-06 16:18:54,681] {base_hook.py:83} INFO - Using connection to: 192.168.1.41
9
[2018-11-06 16:18:54,835] {BaseHookHandler.py:102} INFO - getting hook for conn_id 'fallback_chain.local_metadb.imars_metadata'
[2018-11-06 16:18:54,845] {BaseHookHandler.py:117} WARNING - Chained connection 'local_metadb' not found.
[2018-11-06 16:18:54,852] {base_hook.py:83} INFO - Using connection to: 192.168.1.41
monroe
[2018-11-06 16:18:54,857] {timestrings.py:30} DEBUG - time str is just right.
[2018-11-06 16:18:54,857] {unify_metadata.py:85} INFO - added metadata:
{('order_id', 58613071), ('area_id', 9), ('status_id', 3), ('time', '2018-10-08T12:54:18.000000'), ('area_short_name', 'monroe'), ('date_time', datetime.datetime(2018, 10, 8, 12, 54, 18))}

[2018-11-06 16:18:54,858] {BaseHookHandler.py:102} INFO - getting hook for conn_id 'fallback_chain.local_tmp.imars_objects.imars_http'
[2018-11-06 16:18:54,861] {BaseHookHandler.py:117} WARNING - Chained connection 'local_tmp' not found.
[2018-11-06 16:18:54,870] {BaseHookHandler.py:117} WARNING - Chained connection 'imars_http' not found.
[2018-11-06 16:18:54,871] {FSHookWrapper.py:102} INFO - placing zip_wv2_ftp_ingest (#6)...
[2018-11-06 16:18:54,871] {FSHookWrapper.py:73} INFO - formatting FS path 
>>'/srv/imars-objects/{area_short_name}/zip_wv2_ftp_ingest/wv2_%Y-%m-%dT%H%M%S_{area_short_name}.zip'
[2018-11-06 16:43:57,372] {BaseHookHandler.py:102} INFO - getting hook for conn_id 'fallback_chain.local_metadb.imars_metadata'
[2018-11-06 16:43:57,388] {BaseHookHandler.py:117} WARNING - Chained connection 'local_metadb' not found.
[2018-11-06 16:43:57,403] {base_hook.py:83} INFO - Using connection to: 192.168.1.41
[2018-11-06 16:43:57,439] {dbapi_hook.py:262} INFO - Done loading. Loaded a total of 1 rows

# check for file in object store:
[airflow@imars-airflow16 ~]$ ls -lh /srv/imars-objects/monroe/zip_wv2_ftp_ingest/
total 19G
-rw-r--r--. 1 airflow airflow 19G Nov  6 16:43 wv2_2018-10-08T125418_monroe.zip

# check for file in metadata db
MariaDB [imars_product_metadata]> SELECT * FROM file WHERE filepath LIKE '%wv2_2018-10-08T125418_monroe.zip';
+--------+-------------------------------------------------------------------------------+----------------------------+-------------+------------+---------+-----------+------+-----------+
| id     | filepath                                                                      | date_time                  | is_day_pass | product_id | area_id | status_id | uuid | multihash |
+--------+-------------------------------------------------------------------------------+----------------------------+-------------+------------+---------+-----------+------+-----------+
| 250759 | /srv/imars-objects/monroe/zip_wv2_ftp_ingest/wv2_2018-10-08T125418_monroe.zip | 2018-10-08 12:54:18.000000 |        NULL |          6 |       9 |         3 | NULL | NULL      |
+--------+-------------------------------------------------------------------------------+----------------------------+-------------+------------+---------+-----------+------+-----------+

# check original file removed
[airflow@imars-airflow16 ~]$ ls -lh /srv/imars-objects/ftp-ingest/wv2_2018_10_08T125418_monroe_058613071_10_0.zip
-rw-r--r--. 1 1000 1000 19G Oct 31 00:59 /srv/imars-objects/ftp-ingest/wv2_2018_10_08T125418_monroe_058613071_10_0.zip

So the file loaded mostly okay, but: filepath shouldn't have the /srv/imars-objects/ prefix anymore. Also: Should the original be deleted? I thought so at first, but on second thought maybe it shouldn't.

...oh wait. The file extracts just fine so I guess that /srv/imars-objects prefix is fine for now and USF-IMARS/imars-etl#27 has not been resolved.

7yl4r commented 5 years ago

Testing out the flow with xargs:

-bash-4.2$ ls /srv/imars-objects/ftp-ingest/wv3_2018_11_14T1024*
/srv/imars-objects/ftp-ingest/wv3_2018_11_14T102400_monroe_058691888_10_0.zip
/srv/imars-objects/ftp-ingest/wv3_2018_11_14T102415_monroe_058691890_10_0.zip

-bash-4.2$ ls /srv/imars-objects/ftp-ingest/wv3_2018_11_14T1024*zip | xargs -n 1 -i sh -c 'python3 -m imars_etl -v load --product_id 47 {} && rm {}'
[2018-11-14 20:29:12,847] {settings.py:174} INFO - setting.configure_orm(): Using pool settings. pool_size=1000, pool_recycle=3600
imars_etl.imars_etl.Load.Load: INFO     ------- loading file wv3_2018_11_14T102400_monroe_058691888_10_0.zip ----------------

imars_etl.imars_etl.Load.hashcheck: INFO     computing hash of file...
imars_etl.imars_etl.BaseHookHandler: INFO     getting hook for conn_id 'fallback_chain.local_metadb.imars_metadata'
imars_etl.imars_etl.BaseHookHandler: WARNING  Chained connection 'local_metadb' not found.
imars_etl.imars_etl.filepath.parse_filepath: INFO     params parsed from fname: 
    {'dt_d': 14, 'dt_H': 1024, 'dt_S': 0, 'dt_M': 0, 'dt_Y': 2018, 'order_id': 58691888, 'area_short_name': 'monroe', 'dt_m': 11}
imars_etl.imars_etl.BaseHookHandler: INFO     getting hook for conn_id 'fallback_chain.local_metadb.imars_metadata'
imars_etl.imars_etl.BaseHookHandler: WARNING  Chained connection 'local_metadb' not found.
imars_etl.imars_etl.BaseHookHandler: INFO     getting hook for conn_id 'fallback_chain.local_metadb.imars_metadata'
imars_etl.imars_etl.BaseHookHandler: WARNING  Chained connection 'local_metadb' not found.
imars_etl.imars_etl.Load.unify_metadata: INFO     added metadata:
{('area_short_name', 'monroe'), ('area_id', 9), ('order_id', 58691888), ('time', '2018-11-14T10:24:00.000000'), ('date_time', datetime.datetime(2018, 11, 14, 10, 24))}

imars_etl.imars_etl.BaseHookHandler: INFO     getting hook for conn_id 'fallback_chain.local_tmp.imars_objects.imars_http'
imars_etl.imars_etl.BaseHookHandler: WARNING  Chained connection 'local_tmp' not found.
imars_etl.imars_etl.BaseHookHandler: WARNING  Chained connection 'imars_http' not found.
imars_etl.imars_etl.object_storage.hook_wrappers.FSHookWrapper: INFO     placing zip_wv3_ftp_ingest (#47)...
imars_etl.imars_etl.object_storage.hook_wrappers.FSHookWrapper: INFO     formatting FS path 
>>'/srv/imars-objects/{area_short_name}/zip_wv3_ftp_ingest/wv3_%Y-%m-%dT%H%M%S_{area_short_name}.zip'
imars_etl.imars_etl.BaseHookHandler: INFO     getting hook for conn_id 'fallback_chain.local_metadb.imars_metadata'
imars_etl.imars_etl.BaseHookHandler: WARNING  Chained connection 'local_metadb' not found.
[2018-11-14 20:29:47,609] {settings.py:174} INFO - setting.configure_orm(): Using pool settings. pool_size=1000, pool_recycle=3600
imars_etl.imars_etl.Load.Load: INFO     ------- loading file wv3_2018_11_14T102415_monroe_058691890_10_0.zip ----------------

imars_etl.imars_etl.Load.hashcheck: INFO     computing hash of file...
imars_etl.imars_etl.BaseHookHandler: INFO     getting hook for conn_id 'fallback_chain.local_metadb.imars_metadata'
imars_etl.imars_etl.BaseHookHandler: WARNING  Chained connection 'local_metadb' not found.
imars_etl.imars_etl.filepath.parse_filepath: INFO     params parsed from fname: 
    {'dt_H': 1024, 'dt_M': 1, 'area_short_name': 'monroe', 'order_id': 58691890, 'dt_m': 11, 'dt_S': 5, 'dt_d': 14, 'dt_Y': 2018}
imars_etl.imars_etl.BaseHookHandler: INFO     getting hook for conn_id 'fallback_chain.local_metadb.imars_metadata'
imars_etl.imars_etl.BaseHookHandler: WARNING  Chained connection 'local_metadb' not found.
imars_etl.imars_etl.BaseHookHandler: INFO     getting hook for conn_id 'fallback_chain.local_metadb.imars_metadata'
imars_etl.imars_etl.BaseHookHandler: WARNING  Chained connection 'local_metadb' not found.
imars_etl.imars_etl.Load.unify_metadata: INFO     added metadata:
{('area_short_name', 'monroe'), ('area_id', 9), ('order_id', 58691890), ('time', '2018-11-14T10:24:15.000000'), ('date_time', datetime.datetime(2018, 11, 14, 10, 24, 15))}

imars_etl.imars_etl.BaseHookHandler: INFO     getting hook for conn_id 'fallback_chain.local_tmp.imars_objects.imars_http'
imars_etl.imars_etl.BaseHookHandler: WARNING  Chained connection 'local_tmp' not found.
imars_etl.imars_etl.BaseHookHandler: WARNING  Chained connection 'imars_http' not found.
imars_etl.imars_etl.object_storage.hook_wrappers.FSHookWrapper: INFO     placing zip_wv3_ftp_ingest (#47)...
imars_etl.imars_etl.object_storage.hook_wrappers.FSHookWrapper: INFO     formatting FS path 
>>'/srv/imars-objects/{area_short_name}/zip_wv3_ftp_ingest/wv3_%Y-%m-%dT%H%M%S_{area_short_name}.zip'
imars_etl.imars_etl.BaseHookHandler: INFO     getting hook for conn_id 'fallback_chain.local_metadb.imars_metadata'
imars_etl.imars_etl.BaseHookHandler: WARNING  Chained connection 'local_metadb' not found.

-bash-4.2$ ls /srv/imars-objects/ftp-ingest/wv3_2018_11_14T1024*zip
ls: cannot access /srv/imars-objects/ftp-ingest/wv3_2018_11_14T1024*zip: No such file or directory

-bash-4.2$ python3 -m imars_etl select 'product_id=47 AND date_time="2018-11-14T10:24:15"'
[2018-11-14 20:33:02,947] {settings.py:174} INFO - setting.configure_orm(): Using pool settings. pool_size=1000, pool_recycle=3600
(250761, '/srv/imars-objects/monroe/zip_wv3_ftp_ingest/wv3_2018-11-14T102415_monroe.zip', datetime.datetime(2018, 11, 14, 10, 24, 15), None, 47, 9, None, None, 'QmbPEDEptxf5k4BpVRPJWXjyoViucFJPeUw2coCtamb7Mn')

-bash-4.2$ ls /srv/imars-objects/monroe/zip_wv3_ftp_ingest/
wv3_2018-11-14T102400_monroe.zip  wv3_2018-11-14T102415_monroe.zip

All looks good. File object & metadata loaded, original rmd. status_id isn't right in the database, but that is an easy fix later.

7yl4r commented 5 years ago

basic ingest is working.

files in db:

MariaDB [imars_product_metadata]> SELECT filepath FROM file WHERE product_id=47 ORDER BY date_time DESC LIMIT 3;
+-------------------------------------------------------------------------------+
| filepath                                                                      |
+-------------------------------------------------------------------------------+
| /srv/imars-objects/monroe/zip_wv3_ftp_ingest/wv3_2018-11-14T102758_monroe.zip |
| /srv/imars-objects/monroe/zip_wv3_ftp_ingest/wv3_2018-11-14T102740_monroe.zip |
| /srv/imars-objects/monroe/zip_wv3_ftp_ingest/wv3_2018-11-14T102723_monroe.zip |
+-------------------------------------------------------------------------------+

example extract with imars-etl:

python -m imars_etl extract 'product_id=47 AND date_time="2018-11-14T102758"'