Closed wiktorflorian closed 1 month ago
Hi @wiktorflorian, thank you for raising your pull request. Check your changes at the URL.
Opened #1449
What is the strategy for handling records with invalid characters? Do you skip them? Or maybe you skip entire file? Do we have some counter that will tell at the end how many records were corrupted therefore, they were skipped?
As I wrote above, currently problematic records are exclude. So every problematic row is skipped.
Closes #1434
There's an issue reading JSON files from S3. The current solution excludes problematic records.
Some problematic characters were observed in dataset/part-00000.json.
Fastest error recreation:
and
json_data = json.loads(content_str)
or
Error:
JSONDecodeError: Extra data: line 2 column 1 (char 46115)