[FEATURE] Starting with an existing schema, exclude rows that do not match the existing schema

yusuffgur commented 1 year ago

Current Behavior

Whether starting with the existing schema or not, if the script encounters a change, it logs the changed line. And giving errors like:

Error Log

INFO:root:Problem on line 47730: Ignoring field with mismatched type: old=(hard,dimensionValue,REPEATED,RECORD); new=(hard,dimensionValue,REPEATED,STRING) INFO:root:Problem on line 47732: Ignoring field with mismatched type: old=(hard,dimensionValue,REPEATED,STRING); new=(hard,dimensionValue,REPEATED,RECORD)

Expected Behavior

For example, our file includes like 100000 rows, but there are only 100 rows that do not match the existing schema. But if those nonmatching lines come consecutively, the script detects the first one as problematic, and one matching line that comes after consecutive nonmatching lines is marked as problematic, although it actually matches the existing schema.

Suggested solution

Add a new feature that checks existing files regarding a schema file and excludes rows that do not match the schema and writes them to another JSON/CSV file.

bxparks commented 1 year ago

Can you attach a minimal sample data that illustrates the issue, say something with 3-4 records? I don't think I understand the expected behavior, and this part of the codebase is tricky, so I don't remember all the edge cases since I don't use this project personally anymore.

yusuffgur commented 1 year ago

For example 2-5 are problematic but it gives only 2nd and 6th (not problematic)

{"landingPageClicks":2.0,"costInLocalCurrency":3.211797061516033,"impressions":331.0,"clicks":2.0,"totalEngagements":3.0,"account_id":0000000,"account_name":"dummy","date":"2021-07-10","dimensionValue":[{"servingHoldReasons":["STOPPED","CAMPAIGN_STOPPED"],"lastModifiedAt":1656605739000.0,"content":{"reference":"urn:li:share:6803578573446815744"},"createdAt":1622099537000.0,"review":{"status":"APPROVED"},"id":"urn:li:sponsoredCreative:133406534","lastModifiedBy":"urn:li:system:0","createdBy":"urn:li:person:ghhgfhfgh","isTest":false,"isServing":false,"campaign":"urn:li:sponsoredCampaign:3423525","intendedStatus":"PAUSED","account":"urn:li:sponsoredAccount:234578"}],"dimension":"creative"}
{"landingPageClicks":17.0,"pivotValues_":[{"message":"Call to downstream service failed. Downstream Service Exception: Cannot fetch this creative because the referenced post does not exist.","status":400.0}],"costInLocalCurrency":25.41813751523781,"impressions":2300.0,"clicks":17.0,"totalEngagements":45.0,"account_id":0000000,"account_name":"dummy","date":"2021-06-03","dimensionValue":["urn:li:sponsoredCreative:133406804"],"dimension":"creative"}
{"landingPageClicks":11.0,"pivotValues_":[{"message":"Call to downstream service failed. Downstream Service Exception: Cannot fetch this creative because the referenced post does not exist.","status":400.0}],"costInLocalCurrency":8.094764692519716,"impressions":602.0,"clicks":11.0,"totalEngagements":15.0,"account_id":0000000,"account_name":"dummy","date":"2021-06-27","dimensionValue":["urn:li:sponsoredCreative:133406804"],"dimension":"creative"}
{"landingPageClicks":10.0,"pivotValues_":[{"message":"Call to downstream service failed. Downstream Service Exception: Cannot fetch this creative because the referenced post does not exist.","status":400.0}],"costInLocalCurrency":15.095423445027421,"impressions":999.0,"clicks":10.0,"totalEngagements":19.0,"account_id":0000000,"account_name":"dummy","date":"2021-06-06","dimensionValue":["urn:li:sponsoredCreative:133406804"],"dimension":"creative"}
{"landingPageClicks":19.0,"pivotValues_":[{"message":"Call to downstream service failed. Downstream Service Exception: Cannot fetch this creative because the referenced post does not exist.","status":400.0}],"costInLocalCurrency":26.39521675040559,"impressions":2982.0,"clicks":19.0,"totalEngagements":45.0,"account_id":0000000,"account_name":"dummy","date":"2021-07-19","dimensionValue":["urn:li:sponsoredCreative:133406804"],"dimension":"creative"}
{"landingPageClicks":5.0,"costInLocalCurrency":7.54,"impressions":430.0,"clicks":5.0,"totalEngagements":12.0,"account_id":0000000,"account_name":"dummy","date":"2021-12-25","dimensionValue":[{"servingHoldReasons":["CAMPAIGN_STOPPED","CAMPAIGN_TOTAL_BUDGET_HOLD"],"lastModifiedAt":1656583173000.0,"content":{"reference":"urn:li:share:6879344597777084416"},"createdAt":1640163564000.0,"review":{"status":"APPROVED"},"id":"urn:li:sponsoredCreative:157081644","lastModifiedBy":"urn:li:system:0","createdBy":"urn:li:person:ghhgfhfgh","isTest":false,"isServing":false,"campaign":"urn:li:sponsoredCampaign:235325","intendedStatus":"ACTIVE","account":"urn:li:sponsoredAccount:37865567"}],"dimension":"creative"}

bxparks commented 12 months ago

Thanks for that sample data. It helped to track down a latent bug with the handling of multiple type mismatches. The bug probably existed since the very beginning of the script. The new code fixes the problem with multiple warning messages. It now prints only the first mismatch. The script will now ignore that particular column for all subsequent records, and the resulting schema will not contain the problematic column.

Your proposed solution is unfortunately out of scope for bigquery-schema-generator. I understand that it is what you eventually want to do, I have made the conscious decision to restrict bigquery-schema-generator to be strictly a schema generator, not a data cleanser. There are too many ways people want to sanitize, filter, and massage their data set, and I don't want to be in the business of supporting those endless variations.

bxparks commented 9 months ago

I finally got around releasing v1.6.1 yesterday. PyPI apparently changed its authentication system, so I had to update the release process. I had released v1.6.0 to GitHub back April 1, 2023, but apparently I never got around to pushing it to PyPI. So PyPI releases jump from v1.5.1 to v1.6.1.

This release fixes the problem of multiple warning messages. The tool has no opinions on what to do about those misbehaving records. Every project will want to do different things with them. Some will want to drop those records. Some will want to ignore just those columns, instead of ignoring the entire record. Some will want to convert the problematic field values into something else, using some rules which are appropriate for the specific project instead of contained in the dataset itself.

Each downstream project needs to figure out how to do the data cleansing. The bigquery-schema-generator tool is not a data cleanser, so that feature is out of scope of this tool. This project can be used as a python library, which may be helpful in some of those data cleansing scripts.

bxparks / bigquery-schema-generator