NASA-PDS / web-analytics

Other
0 stars 0 forks source link

Implement Additional Data Pipeline Pre-processing #41

Open kaipak opened 2 months ago

kaipak commented 2 months ago

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Data Scientist, Data Engineer

💪 Motivation

There is some amount of processing occurring on the log files including standardizing to Apache CLF, however, some additional pre-processing would help simplify the data pipeline further downstream and reduce complexity and errors. For example, datetime string is currently in a format that is not standard to typical data science tools such as SQL and Python. This requires complex string-matching logic and date manipulation that might have to be applied in different places further down the pipeline.

Also, some logs exhibit errors such as incorrectly formed datetime strings, and missing fields. Fixing these issues at the data source rather than further downstream would simplify the pipeline and lead to a more robust system.

📖 Additional Details

Recommendations:

⚙️ Engineering Details

For datetime:

Regex currently used to convert the date:

WHEN REGEXP_LIKE(datetime, '\[\d{2}\/\w{3}\/\d{4}:.+\]') THEN REGEXP_EXTRACT(datetime, '\d{2}\/\w{3}\/\d{4}')
jordanpadams commented 2 months ago

@kaipak

datetime string is currently in a format that is not standard to typical data science tools

can you describe the format that is preferred for the datetime format?

some logs exhibit errors such as incorrectly formed datetime strings, and missing fields

do you have a couple examples of this?

also, what happens when these things occur? Are those records just ignored?