clinical-biomarkers / biomarker-partnership

CFDE Biomarker Partnership
https://hivelab.biochemistry.gwu.edu/biomarker-partnership
1 stars 0 forks source link

Large dataset format conversion speed #121

Closed seankim658 closed 2 months ago

seankim658 commented 2 months ago

Currently re-writing format conversion code so it is more optimized for larger files. Known issue in JSON to TSV conversion where for very large files large string processing eats a significant amount of system resources. Each successive string operation on the TSV content variable grows exponentially in time complexity.

If running into this issue before the rewrite is finished can add this code to the top of the main processing loop (adjusting the write checkpoint amount as desired):

if top_level_entry_idx % 10_000 == 0:
            with open(target_filepath, 'a') as f:
                f.write(tsv_content)
            misc_fns.print_and_log(f'Write checkpoint hit at row {top_level_entry_idx}, dumping...', 'info')
            tsv_content = ''