abhishekkrthakur / csv_test

25 stars 24 forks source link

Maybe Least Memory extensive method? #2

Closed ManishSahu53 closed 3 years ago

ManishSahu53 commented 3 years ago
import os
import csv
path_data = 'csv_test/data/'
## Writing data to CSV
def writedata(csvobject, row):
    """
        Write data to CSV
    """
    pamwriter = csv.writer(csvobject, delimiter=',')
    pamwriter.writerow(row)

## To get all CSV files
path_csv = []
for root, dirs, files in os.walk(path_data): 
    for file in files:
        if 'csv' in file[-3:]:
            path_csv.append(os.path.join(root, file))

## Creating merged file
path_merge = open( 'merged.csv' , 'a')

## Reading and Writing header columns
with open(path_csv[0]) as f:
    lines = csv.reader(f)
    for line in lines:
        writedata(path_merge, line)
        break

## Reading and Writing data to merged CSV
for temp_path_csv in path_csv:
    index = 0
    with open(temp_path_csv) as f:
        lines = csv.reader(f)
        for line in lines:
            index +=1

            # Skip Header FIle
            if index == 1:
                continue

        writedata(path_merge, line)
        print('Writing {} finished'.format(temp_path_csv))
ManishSahu53 commented 3 years ago

CPU times: user 1min 15s, sys: 14.1 s, total: 1min 30s Wall time: 1min 40s

ManishSahu53 commented 3 years ago

We can use multithreads too to increase speed.

abhishekkrthakur commented 3 years ago

Thanks! Can you think of a way to do this without importing csv? ;)

ManishSahu53 commented 3 years ago

Yes, I use csv module so that if there is text column and contains "," then my alignment wont be disturbed. CPU times: user 7.59 s, sys: 4.16 s, total: 11.8 s Wall time: 28.9 s

%%time
path_data = 'csv_test/data/'
# Writing data to CSV
def writedata(csvobject, row):
    """
        Write data to CSV
    """
    csvobject.write(row)

# To get all CSV files
path_csv = []
for root, dirs, files in os.walk(path_data): 
    for file in files:
        if 'csv' in file[-3:]:
            path_csv.append(os.path.join(root, file))

## Creating merged file
path_merge = open( 'merged_no_csv.csv' , 'a')

## Reading and Writing header columns
with open(path_csv[0]) as f:
    for line in f:
        writedata(path_merge, line)
        break

# Reading and Writing data to merged CSV
for temp_path_csv in path_csv:
    index = 0
    with open(temp_path_csv) as f:
        for line in f:
            index +=1

            # Skip Header FIle
            if index == 1:
                continue

        writedata(path_merge, line)
ashishu007 commented 3 years ago
import os
from datetime import datetime

t1 = datetime.now()

data_dir = './data'

csvs = os.listdir(data_dir)

# get the header
with open(os.path.join(data_dir, csvs[0])) as f:
    data = f.readlines()
headers = [data[0]]

# append each rows from every csv into the header list
for csv_file_name in csvs:
    with open(os.path.join(data_dir, csv_file_name)) as f:
        data = f.readlines()
    headers.extend(data[1:])

# save the header list as .csv file
with open('./output/final.csv', 'w') as f:
    f.write(''.join(headers))

t2 = datetime.now()

t3 = t2 - t1

seconds = t3.total_seconds()
print(f'{seconds}: total seconds\n')
hours = seconds // 3600
minutes = (seconds % 3600) // 60
seconds = seconds % 60
print(f'{hours}: hour\t{minutes}: minutes\t{seconds}: seconds')

Takes around 1 min. 12 secs.

ManishSahu53 commented 3 years ago

@ashishu007, I think You can have out of memory error because of headers variable.

ashishu007 commented 3 years ago

@ManishSahu53, probably yes. I used a 16GB RAM system - so was smooth for me. Thanks for pointing out :)

abhishekkrthakur commented 3 years ago

I added a couple of solutions to issues: one pandas based and one in pure python. :) Please take a look.