GlobalFishingWatch / anchorages_pipeline

Python pipeline for anchorages
Apache License 2.0
6 stars 3 forks source link

Adds Docker and requirements for scheduler and worker #80

Closed smpiano closed 2 years ago

smpiano commented 2 years ago

DRAFT:

  1. port-events was tested, the job run well. The results of the data was weird compared to the current pipeline: job: https://console.cloud.google.com/dataflow/jobs/us-central1/2022-03-10_13_23_42-8687066028995394056?project=world-fishing-827
select count(*) from `scratch_matias_ttl_60_days.proto_raw_port_events_20220308`
--943669
select count(*) from `pipe_production_v20201001.proto_raw_port_events_20220308`
--983562
  1. If use events_table the pipeline and not the one generated in mine scratch, the port-visits fails due to errors in JSON:
    gs://scratch-matias/dataflow/dataflow_temp/bq_load/a21df9a6c1c646a392b44574908800b3/world-fishing-827.scratch_matias_ttl_60_days.proto_port_visits/d17e1b98-dfda-4b06-b6b0-c970d58b6362: Error whilereading data, error message: JSON parsing error in row starting atposition 672: Missing required fields: confidence, duration_hrs,ssvid.

    job: https://console.cloud.google.com/dataflow/jobs/us-central1/2022-03-10_19_39_26-5403262428179266717?project=world-fishing-827

Related with> https://globalfishingwatch.atlassian.net/browse/PIPELINE-809

smpiano commented 2 years ago

After applying @andres-arana's changes: Create Port Visits: https://console.cloud.google.com/dataflow/jobs/us-central1/2022-03-16_08_24_22-993121888725945578?project=world-fishing-827 Output:

Create Port Events: https://console.cloud.google.com/dataflow/jobs/us-central1/2022-03-16_08_39_52-15649841229488053454?project=world-fishing-827 Output: