GlobalFishingWatch / anchorages_pipeline

Python pipeline for anchorages
Apache License 2.0
6 stars 3 forks source link

Improve memory efficiency #88

Closed smpiano closed 2 years ago

smpiano commented 2 years ago

From slack:

The launch commands changed. It’s still a two step process though. First:

docker-compose run thin_port_messages \
--job_name porteventstest \
--input_table pipe_production_v20201001.position_messages_ \
--anchorage_table anchorages.named_anchorages_v20201104 \
--start_date 2018-01-01 \
--end_date 2018-01-07 \
--output_table machine_learning_dev_ttl_120d.port_visit_msgs_v20220927_ \
--project world-fishing-827 \
--max_num_workers 100 \
--project world-fishing-827 \
--staging_location gs://machine-learning-dev-ttl-30d/anchorages/portevents/output/staging \
--temp_location gs://machine-learning-dev-ttl-30d/anchorages/temp \
--setup_file ./setup.py \
--runner DataflowRunner \
--disk_size_gb 100 \
--region us-central1 \
--sdk_container_image gcr.io/world-fishing-827/pipe-anchorage/worker:tim_test \
--experiments=use_runner_v2

This is the part that generates an internal tables. Then:

docker-compose run port_visits \
--job_name portmessagestest \
--thinned_message_table machine_learning_dev_ttl_120d.port_visit_msgs_v20220927_ \
--end_date 2018-01-07 \
--vessel_id_table pipe_production_v20201001.segment_info \
--anchorage_table anchorages.named_anchorages_v20201104 \
--output_table machine_learning_dev_ttl_120d.port_visits_v20220927_ \
--project world-fishing-827 \
--max_num_workers 100 \
--project world-fishing-827 \
--staging_location gs://machine-learning-dev-ttl-30d/anchorages/portevents/output/staging \
--temp_location gs://machine-learning-dev-ttl-30d/anchorages/temp \
--setup_file ./setup.py \
--runner DataflowRunner \
--disk_size_gb 100 \
--region us-central1 \
--sdk_container_image gcr.io/world-fishing-827/pipe-anchorage/worker:tim_test \
--experiments=use_runner_v2 \
--bad_segs "(SELECT DISTINCT seg_id FROM world-fishing-827.gfw_research.pipe_v20201001_segs WHERE overlapping_and_short)"

This generates the actual visits. It regenerates the whole table from scratch every day (same as previously because vessel ID changes).

NOTE: tests still failing.