GoogleCloudPlatform / document-ai-samples

Sample applications and demos for Document AI, the end-to-end document processing platform on Google Cloud
https://cloud.google.com/document-ai
Apache License 2.0
241 stars 105 forks source link

Error while reading data, error message-Fraud Detection #897

Open puranjay123 opened 2 months ago

puranjay123 commented 2 months ago

Issue: BigQuery Table Not Populating After Uploading Invoice

Description:

While running the fraud detection use case, I followed all the steps mentioned in the README. After uploading an invoice to the GCS bucket, I noticed that the BigQuery tables are not getting populated. I encountered the following error:

Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0
Error while reading data, error message: JSON parsing error in row starting at position 0: No such field: invoice_date.

Steps to Reproduce:

  1. Follow the README instructions to set up the fraud detection use case.
  2. Upload an invoice file to the specified GCS bucket.
  3. Observe that BigQuery tables are not populated and the error message above is displayed.

What I’ve Tried:

I suspect the issue lies with the ALLOW_FIELD_ADDITION option in the BigQuery load job configuration. The schema does not seem to update based on the input data. Below is the relevant code that seems to be causing the issue:

def write_to_bq(dataset_name, table_name, entities_extracted_dict):
    """
    Write Data to BigQuery
    """
    dataset_ref = bq_client.dataset(dataset_name)
    table_ref = dataset_ref.table(table_name)
    row_to_insert = []
    row_to_insert.append(entities_extracted_dict)

    json_data = json.dumps(row_to_insert, sort_keys=False)
    # Convert to a JSON Object
    json_object = json.loads(json_data)

    schema_update_options = [
        bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
        bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION,
    ]
    source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON

    job_config = bigquery.LoadJobConfig(
        schema_update_options=schema_update_options,
        source_format=source_format,
    )

    job = bq_client.load_table_from_json(json_object, table_ref, job_config=job_config)
    print(job.result())  # Waits for table load to complete.

Request for Help: Could anyone help me resolve the issue with the schema update? Specifically, the ALLOW_FIELD_ADDITION option doesn't seem to be functioning as expected, and the table is not accepting new fields from the uploaded JSON data.