Closed tomreitz closed 1 year ago
Based on our conversation today, I'm proposing that
lightbeam
's structured output is enabled with a cli flag --results-file ./results.json
{
"timestamp": "2023-05-15 02:04:18Z00",
"api_url": "https://edfi3.somedistrict.org/api/",
"source_dir": "/path/to/some/dir/",
"resources": {
"studentSchoolAssociations": {
"records_processed": 100,
"records_skipped" :10,
"records_attempted": 90,
"records_success": 75,
"records_failed": 15,
"failed_statuses": {
"409": {
"messages": [
{
"message": "Invalid StudentReference",
"count": 6,
"line_numbers": [3, 7, 8, 31, 64, 88]
},
{
"message": "Invalid SchoolReference",
"count": 9,
"lines": [5, 9, 12, 13, 26, 27, 29, 30, 56]
},
]
},
...
}
},
...
}
"total_records_processed": 100,
"total_records_skipped": 10,
"total_records_attempted": 90,
"total_records_success": 75,
"total_records_failed": 15
}
Is this too verbose? i.e., do we need the total_
across all resources?
We also discussed the exit codes lightbeam
should return in different circumstances, to trigger the AirFlow BashOperator task to raise an AirflowSkipException
or AirflowException
(per the Airflow docs). Is it correct that
total_records_skipped
== total_records_processed
, lightbeam should return exit code 99? (so the Airflow task skips)total_records_failed
>0 lightbeam
should return exit code 1? (so the Airflow task fails)One other issue has to do with reporting line numbers: lightbeam
can process either a single JSONL file (source_dir/students.jsonl
) or an entire directory of JSONL files (source_dir/students/*.jsonl
). In the latter case, how would line numbers be determined/interpreted?
Finally, one tiny question, should line numbers be 0-based or 1-based?
Some thoughts on this one as well:
timestamp
to a more descriptive name that says whether that time is when the run started vs ended.total_records_attempted
. Is there any case where a record that is not skipped is not attempted?{
"message": "Invalid StudentReference",
"total_count": 10,
"files": [
{
"file": "{FILENAME1}",
"count": 6,
"line_numbers": [3, 7, 8, 31, 64, 88]
},
{
"file": "{FILENAME2}",
"count": 4,
"line_numbers": [2, 6, 10, 35]
},
]
}
To use lightbeam in an automated data pipeline it would be useful to have it generate structured output noting the number of records sent and skipped by resource and status code, the run timestamp, and possibly other information. Open questions include:
Tagging @jayckaiser for thoughts.