Open kaipak opened 2 months ago
@kaipak
datetime string is currently in a format that is not standard to typical data science tools
can you describe the format that is preferred for the datetime format?
some logs exhibit errors such as incorrectly formed datetime strings, and missing fields
do you have a couple examples of this?
also, what happens when these things occur? Are those records just ignored?
Checked for duplicates
Yes - I've already checked
🧑🔬 User Persona(s)
Data Scientist, Data Engineer
💪 Motivation
There is some amount of processing occurring on the log files including standardizing to Apache CLF, however, some additional pre-processing would help simplify the data pipeline further downstream and reduce complexity and errors. For example, datetime string is currently in a format that is not standard to typical data science tools such as SQL and Python. This requires complex string-matching logic and date manipulation that might have to be applied in different places further down the pipeline.
Also, some logs exhibit errors such as incorrectly formed datetime strings, and missing fields. Fixing these issues at the data source rather than further downstream would simplify the pipeline and lead to a more robust system.
📖 Additional Details
Recommendations:
⚙️ Engineering Details
For datetime:
[20/Jul/2023:00:13:09 -0700]
2023-07-23T00:13:09Z
Regex currently used to convert the date: