Hard to read the pipeline

fortuna commented 2 years ago

I noticed that we are ingesting files with different types with the same functions in the pipeline. Furthermore, the types are not only different, but sometimes only slightly different, causing even more confusion. Add to that the fact that all the different data types are simply called "row" and it becomes a huge effort to make sense of how data is flowing.

We have to clean that up so we can make sense of what's going on. Especially with Satellite, which has many types.

Split flows

First and foremost, this call has to change: https://github.com/censoredplanet/censoredplanet-analysis/blob/777f21b4b1615c7cf2b291d7387775597b3ae5e6/pipeline/beam_tables.py#L610

We should not create a single PCollection with different types. Instead, each file should be its own PCollection.

Then process_satellite_lines should be removed in favor of different flows that process and join the different datasets. The partition logic can be gone.

As a rule of thumb, consider all logic selection based on filenames like this harmful: https://github.com/censoredplanet/censoredplanet-analysis/blob/777f21b4b1615c7cf2b291d7387775597b3ae5e6/pipeline/metadata/flatten_satellite.py#L211

Another similar practice that is also harmful is to detect the source file based on the presence of fields: https://github.com/censoredplanet/censoredplanet-analysis/blob/777f21b4b1615c7cf2b291d7387775597b3ae5e6/pipeline/metadata/satellite.py#L124

Those practices can be replaced by creating separate PCollections for each input types.

This cleanup can be incremental. For instance, it seems that extracting the blockpage logic is quite easy. Just call _process_satellite_blockpages on a PCollection with the blockpage files only.

Then you can extract each tag input file to their own flows.

I have the impression that this cleanup will speed up the pipeline, as you can have more lightweight workers for many parts of the flow, instead of having to load all the data for all the flows. It will also shuffle less data for joins (each flow can be sorted separately).

Define and clarify row types

Another significant improvement is to name each data type and create type aliases for the existing Row (e.g. SatelliteScan or ResolverTags). We should not see the Row type anywhere. Note that they can all be a generic dictionary type, but the type annotations will help understand what goes where. We should also rename the variables to reflect their type.

/cc @ohnorobo @avirkud

TODOs

[x] triple cogroup
[x] add strict types
[ ] remove all filename selection in logic
[ ] switch to 'big cogroup' instead of doing individual joins

fortuna commented 2 years ago

Reference for types: https://docs.censoredplanet.org/dns.html

For inputs, we can define some types like BlockpagesEntry or ResultsEntry for blockpages.json or results.json, for example. Then we can think of names for the intermediate data.

fortuna commented 2 years ago

I talked to @ohnorobo. We'll update the pipeline to remove the redundant duplication of the observations, which should significantly improve the pipeline performance and address some of the modeling concerns. She described the change in a TODO here: https://github.com/censoredplanet/censoredplanet-analysis/blob/13a163ada70d851e84f8779d7e2cd2d13d53a04f/pipeline/metadata/satellite.py#L471-L492

censoredplanet / censoredplanet-analysis