Implement Additional Data Pipeline Pre-processing

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Data Scientist, Data Engineer

💪 Motivation

There is some amount of processing occurring on the log files including standardizing to Apache CLF, however, some additional pre-processing would help simplify the data pipeline further downstream and reduce complexity and errors. For example, datetime string is currently in a format that is not standard to typical data science tools such as SQL and Python. This requires complex string-matching logic and date manipulation that might have to be applied in different places further down the pipeline.

Also, some logs exhibit errors such as incorrectly formed datetime strings, and missing fields. Fixing these issues at the data source rather than further downstream would simplify the pipeline and lead to a more robust system.

📖 Additional Details

Recommendations:

Implement pre-processing step to convert datetimes to a standard commonly used in data analytics
Ensure consistent delimiters and fields
Scan logs for errors or mangled entries and fix or remove.
Generate logs or report on processing steps

⚙️ Engineering Details

For datetime:

Current: [20/Jul/2023:00:13:09 -0700]
Preferred: 2023-07-23T00:13:09Z

Regex currently used to convert the date:

WHEN REGEXP_LIKE(datetime, '\[\d{2}\/\w{3}\/\d{4}:.+\]') THEN REGEXP_EXTRACT(datetime, '\d{2}\/\w{3}\/\d{4}')

NASA-PDS / web-analytics