Cache port lookups - Githubissues

bitsofbits commented 5 months ago

This builds on the changes in #102 by also caching port location information in the thinned messages so that this does not need to be recalculated each time. The hope is that this will improve performance of port_visits.

bitsofbits commented 5 months ago

The two stages were tested for running and that their results were consistent with the output of #102 using:

 docker-compose run thin_port_messages \
        --job_name thinportmessagestest \
        --input_table 'pipe_ais_v3_alpha_internal.messages_segmented_2021*' \
        --anchorage_table anchorages.named_anchorages_v20240117 \
        --start_date 2021-01-01 \
        --end_date 2021-09-30 \
        --output_table machine_learning_dev_ttl_120d.port_visit_msgs_performance_fix1_ \
        --project world-fishing-827 \
        --max_num_workers 20 \
        --staging_location gs://machine-learning-dev-ttl-120d-central1/anchorages/portevents/output/staging \
        --temp_location gs://machine-learning-dev-ttl-120d-central1/anchorages/temp \
        --setup_file ./setup.py \
        --runner DataflowRunner \
        --disk_size_gb 100 \
        --region us-central1 \
        --sdk_container_image gcr.io/world-fishing-827/pipe-anchorage/worker:tim_test \
        --experiments=use_runner_v2 \
        --labels=environment=test \
        --labels=resource_creator=tim \
        --labels=project=core_pipeline \
        --labels=version=tag-possible-gaps \
        --labels=step=port-visits \
        --labels=stage=productive \
        --ssvid_filter='(select ssvid from `machine_learning_dev_ttl_120d.ssvid_sample_for_port_visits`)'

docker-compose run port_visits \
        --job_name portvisittest \
        --thinned_message_table machine_learning_dev_ttl_120d.port_visit_msgs_performance_fix1_ \
        --end_date 2021-09-30 \
        --vessel_id_table pipe_ais_v3_alpha_published.segment_info \
        --anchorage_table anchorages.named_anchorages_v20240117 \
        --output_table machine_learning_dev_ttl_120d.port_visits_performance_fix1_ \
        --project world-fishing-827 \
        --max_num_workers 100 \
        --project world-fishing-827 \
        --staging_location gs://machine-learning-dev-ttl-120d-central1/anchorages/portevents/output/staging \
        --temp_location gs://machine-learning-dev-ttl-120d-central1/anchorages/temp \
        --setup_file ./setup.py \
        --runner DataflowRunner \
        --disk_size_gb 100 \
        --region us-central1 \
        --sdk_container_image gcr.io/world-fishing-827/pipe-anchorage/worker:tim_test \
        --experiments=use_runner_v2 \
        --labels=environment=test \
        --labels=resource_creator=tim \
        --labels=project=core_pipeline \
        --labels=version=tag-possible-gaps \
        --labels=step=port-visits \
        --labels=stage=productive \
        --bad_segs "(SELECT DISTINCT seg_id FROM pipe_ais_v3_alpha_published.segs_activity WHERE overlapping_and_short)"

Then I checked that output tables matched earlier tables using

select * from
((   SELECT * except (events) FROM `machine_learning_dev_ttl_120d.port_visits_performance_fix1_*`
    where date(end_timestamp) > '2012-1-1'
    EXCEPT DISTINCT
    SELECT * except (events) FROM `machine_learning_dev_ttl_120d.port_visits_gapfix3c_*`  
    where date(end_timestamp) > '2012-1-1'
    )
UNION ALL
(   SELECT * except (events) FROM `machine_learning_dev_ttl_120d.port_visits_gapfix3c_*`
    where date(end_timestamp) > '2012-1-1'
    EXCEPT DISTINCT
    SELECT * except (events) FROM `machine_learning_dev_ttl_120d.port_visits_performance_fix1_*`
    where date(end_timestamp) > '2012-1-1'
    ))
order by start_timestamp

bitsofbits commented 5 months ago

The above is a very small test, so comparing total runtimes isn't informative, but comparing the runtime of CreateInOutEvents, which should be the expensive part, gives 6s for the new code, vs 289s for the old code or a 48x times speed up. It seems unlikely that we'll actually get that much speed in practice, but it seems promising.

GlobalFishingWatch / anchorages_pipeline

Cache port lookups #103