Import function fails to handle partial json objects and large files in v6.1.0

t-jones commented 2 months ago

Describe the bug In v6.1.0, if an import file is larger than 20000 bytes and hence requires multiple reads from S3, json objects at the beginning of each chunk after the first chunk are truncated and so fail to parse, which causes an unknown number of qid to be dropped when importing a qna export. Also, the new code will read 15 chunks of import data from S3. If each chunk is only ~20k, this is only 300K, which is much too small. These values should be configurable, or the threshold needs to be much higher.

Replacing this code with the v6.0.1 version fixes the issue. This is a regression over previous behavior.

To Reproduce

Create or procure a large export file. This was recreated with a 1MB file containing 195 qid.
Import the file.
Compare the number of successfully imported QID to the actual number in the file. In this case, 54 of the 195 qid were successfully imported.

Expected behavior All qid should be imported.

Please complete the following information about the solution:

[X ] Version: (SO0189) QnABot with admin and client websites - Version v6.1.0
[ X] Region: us-west-2
[ X] Was the solution modified from the version published on this repository? No
[ X] Have you checked your service quotas for the services this solution uses? - This issue is not caused by service quotas.

[ X] Were there any errors in the CloudWatch Logs? Yes, there are several errors that look like the following:

2024-09-20T22:42:41.262Z    80513f94-33be-4d7a-ab5b-b879ddec27cf    INFO    Failed to Parse: Unexpected token u in JSON at position 0 undefined <partial json text from the import file>

Stack trace looks like

2024-09-20T22:42:41.262Z    80513f94-33be-4d7a-ab5b-b879ddec27cf    INFO    SyntaxError: Unexpected token u in JSON at position 0
at JSON.parse (<anonymous>)
at processQuestionObjects (/var/task/index.js:242:28)
at exports.step [as handler] (/var/task/index.js:105:60)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Additional context There is a log line like

2024-09-20T22:42:41.438Z    80513f94-33be-4d7a-ab5b-b879ddec27cf    INFO    ContentRange:  bytes 20001-40001/840237

indicating that this is a multi-read file. Also, the first parse of a new chunk always fails. Also at 840237 bytes, less than half this file is actually processed.

Couple other points:

Would be great to take out all the Embeddings disabled - EMBEDDINGS_ENABLE: false log lines. Perhaps they should be in debug only? Coming from embeddings.js.

fhoueto-amz commented 2 months ago

Thanks, we will look into this and come back

tmekari commented 2 months ago

This has been addressed in our v6.1.1 release. Thanks for bringing this to our attention, please feel free to reach out if you have any other issues!

aws-solutions / qnabot-on-aws

Import function fails to handle partial json objects and large files in v6.1.0 #766