feat: [#1434] read and transform dump

wiktorflorian commented 1 month ago

Closes #1434

There's an issue reading JSON files from S3. The current solution excludes problematic records.

Some problematic characters were observed in dataset/part-00000.json.

Fastest error recreation:

def connect_to_s3(access_key: str, secret_key: str, endpoint: str) -> boto3.client:
    """Connect to s3"""
    session = boto3.session.Session()
    s3_client = session.client(
        service_name="s3",
        aws_access_key_id=access_key,
        aws_secret_access_key=secret_key,
        endpoint_url=endpoint,
    )
    return s3_client 

s3_client = connect_to_s3(S3_ACCESS_KEY, S3_SECRET_KEY, str(S3_ENDPOINT))
key = 'test/dataset/part-00000.json'
s3_object = s3_client.get_object(Bucket=S3_BUCKET, Key=key)
s3_object_body = s3_object.get('Body')
content = s3_object_body.read()
content_str = content.decode('utf-8')

and json_data = json.loads(content_str)

or

json_objects = content_str.splitlines()
for line in json_objects:
    json_data = json.loads(line)

Error: JSONDecodeError: Extra data: line 2 column 1 (char 46115)

github-actions[bot] commented 1 month ago

Hi @wiktorflorian, thank you for raising your pull request. Check your changes at the URL.

wiktorflorian commented 1 month ago

Opened #1449

wiktorflorian commented 1 month ago

What is the strategy for handling records with invalid characters? Do you skip them? Or maybe you skip entire file? Do we have some counter that will tell at the end how many records were corrupted therefore, they were skipped?

As I wrote above, currently problematic records are exclude. So every problematic row is skipped.

Michal-Kolomanski commented 1 month ago

Opens https://github.com/cyfronet-fid/eosc-search-service/issues/1450

cyfronet-fid / eosc-search-service

feat: [#1434] read and transform dump #1448