Making sure that no duplicate dataframes are being created during processing
Converting from tsv to parquet before processing
Yet to try:
[ ] Decompress .gz and use polars scan_csv to take advantage of lazyframe
Easy way out:
Take last tconst entry of SQL DB, only process IMDb dataset starting from that tconst to the end of the dataset.
(Lowers data required to be processed significantly but would be reliant on the IMDb dataset staying in the same order)
Have tried:
Yet to try:
Easy way out: Take last tconst entry of SQL DB, only process IMDb dataset starting from that tconst to the end of the dataset. (Lowers data required to be processed significantly but would be reliant on the IMDb dataset staying in the same order)