Open edschofield opened 9 years ago
@edschofield thanks!
Can you provide a line example of the original .csv to produce the DataFrame? So I can get an idea of how to produce it with toolz
.
The original data file is in the USEARCH cluster format for sequence analysis (see here: http://drive5.com/usearch/manual/opt_uc.html). An example is here:
A very poor script to create a contingency table is provided here: http://drive5.com/python/uc2otutab_py.html; this takes over 3 days to process a 13 GB file on a beefy machine.
Only the last two columns from the sample file are interesting. The second-last column can be cleaned up by stripping its "barcode label" as follows:
def clean(s):
start = s.index('=') + 1
end = s.index(';')
return s[start:end]
I realised today that the whole solution can be boiled down to something extremely simple using either collections.Counter
or toolz.frequencies
together with zip()
or similar:
def parse_data(f):
for line in f:
cols = line.split('\t')
yield clean(cols[-2]), cols[-1].strip()
with open('readmap.uc') as f:
counts = collections.Counter(parse_data(f))
This version only takes 6min 43s on my laptop and is more elegant than using groupby
and unstack
with Pandas.
Here's another simple toy example (on Py3, where zip()
returns an iterator):
import string
categories1 = ['red', 'green', 'blue', 'black', 'yellow', 'pink', 'purple', 'chocolate']
categories2 = list(string.ascii_lowercase)
N = 10**6
data1 = np.random.choice(categories1, size=N)
data2 = np.random.choice(categories2, size=N)
samples = pd.DataFrame({'category1': data1, 'category2': data2})
Then you can create a contingency table elegantly and memory-efficiently using this line:
counts = collections.Counter(zip(samples['category1'], samples['category2']))
Using collections.Counter(zip(data1, data2))
on the NumPy arrays directly is more elegant still but, bizarrely, more than 3 times slower than by going through Pandas ...
@edschofield
This version only takes 6min 43s on my laptop and is more elegant than using groupby and unstack with Pandas.
Toldja! ;) But this is nevertheless a great usecase to add to the streaming chapter!
NumPy arrays directly is more elegant still but, bizarrely, more than 3 times slower than by going through Pandas
That is bizarre! Definitely something to investigate further...!
If you have a table of observations, such as species and query sequences, as a DataFrame called
samples
as follows:these two lines can give you a contingency table:
Pandas provides this as a function called
pd.crosstab()
, but it's about 3 times slower than the above line (as of Pandas v0.16.0)...Another simple example:
counts
is then:It takes about 1 hour on my laptop to process a 13 GB CSV file of samples this way. It's a little cumbersome using the
chunksize
argument ofpd.read_csv()
. Perhaps this can be generalised nicely to huge datasets usingcytoolz
without sacrificing performance too much?! :-)