Improve memory efficiency

updates Manifest.in
Removes old files requried for custom Airflow.
Integrate work from Tim on improving memory efficiency.
New use of command thin_port_messages in replace of port_events.
Removal of tables port_state.
Updates the main.py on showing available commands
Query the table position_messages using partitioned data.

From slack:

The launch commands changed. It’s still a two step process though. First:

docker-compose run thin_port_messages \
--job_name porteventstest \
--input_table pipe_production_v20201001.position_messages_ \
--anchorage_table anchorages.named_anchorages_v20201104 \
--start_date 2018-01-01 \
--end_date 2018-01-07 \
--output_table machine_learning_dev_ttl_120d.port_visit_msgs_v20220927_ \
--project world-fishing-827 \
--max_num_workers 100 \
--project world-fishing-827 \
--staging_location gs://machine-learning-dev-ttl-30d/anchorages/portevents/output/staging \
--temp_location gs://machine-learning-dev-ttl-30d/anchorages/temp \
--setup_file ./setup.py \
--runner DataflowRunner \
--disk_size_gb 100 \
--region us-central1 \
--sdk_container_image gcr.io/world-fishing-827/pipe-anchorage/worker:tim_test \
--experiments=use_runner_v2

This is the part that generates an internal tables. Then:

docker-compose run port_visits \
--job_name portmessagestest \
--thinned_message_table machine_learning_dev_ttl_120d.port_visit_msgs_v20220927_ \
--end_date 2018-01-07 \
--vessel_id_table pipe_production_v20201001.segment_info \
--anchorage_table anchorages.named_anchorages_v20201104 \
--output_table machine_learning_dev_ttl_120d.port_visits_v20220927_ \
--project world-fishing-827 \
--max_num_workers 100 \
--project world-fishing-827 \
--staging_location gs://machine-learning-dev-ttl-30d/anchorages/portevents/output/staging \
--temp_location gs://machine-learning-dev-ttl-30d/anchorages/temp \
--setup_file ./setup.py \
--runner DataflowRunner \
--disk_size_gb 100 \
--region us-central1 \
--sdk_container_image gcr.io/world-fishing-827/pipe-anchorage/worker:tim_test \
--experiments=use_runner_v2 \
--bad_segs "(SELECT DISTINCT seg_id FROM world-fishing-827.gfw_research.pipe_v20201001_segs WHERE overlapping_and_short)"

This generates the actual visits. It regenerates the whole table from scratch every day (same as previously because vessel ID changes).

NOTE: tests still failing.

GlobalFishingWatch / anchorages_pipeline

Improve memory efficiency #88