GSS-Cogs / DataEngineering_Airflow_Alpha

0 stars 0 forks source link

Pipeline Dev: Code Alternatives #25

Open RedWalters opened 1 year ago

RedWalters commented 1 year ago

This Issue is just going to be a log of certain code or processes in the pipeline that I think could either be punched up or replaced, either to help future proof or simply improve performance/quality.

If anyone else has any suggestions of stuff in the code that could do with a punch up feel free to add below, my code can be very rough around the edges (or outright bad haha) so any contribution is welcome

RedWalters commented 1 year ago

To fix an issue with URI matching I currently use a BashOperator to find/replace the incorrect URI base with the correct on. The find and replace solution was taken directly from the old Jenkins pipelines so it's not necessarily such a bad thing, but since the BashOperator needs its own task its not only slows the process slightly its increasing the number of tasks unnecessarily which makes the pipelines harder to read.

A python code replacement which reads in the ttl file and makes the necessary edits and then saves back to file would be good as we can hide this in the pipeline code without its own task. RDFLib (which is already used in places in the pipeline) might be able to do this.