Contingency table - Githubissues

edschofield commented 9 years ago

If you have a table of observations, such as species and query sequences, as a DataFrame called samples as follows:

	species	queryseq
0	16S_OTUa_1885	soilDNA_A5K1H
1	16S_OTUa_5099	soilDNA_A5K1H
2	16S_OTUa_4	soilDNA_A5K1H
3	16S_OTUa_4051	soilDNA_A5K1H
4	chimera4_OTU_55535	soilDNA_A5K1H
5	16S_OTUa_3940	soilDNA_A5K2H
6	16S_OTUa_3940	soilDNA_A5K2H
7	16S_OTUa_55	soilDNA_A5K2H
8	16S_OTUa_2501	soilDNA_A5K2H
9	chimera4_OTU_41471	soilDNA_A5K2H

these two lines can give you a contingency table:

samples['count'] = 1
counts = samples.groupby(['species', 'queryseq'])['count'].sum().unstack()

Pandas provides this as a function called pd.crosstab(), but it's about 3 times slower than the above line (as of Pandas v0.16.0)...

Another simple example:

samples = pd.DataFrame({'handedness': ['right', 'left', 'left', 'left', 'left', 'right'],
                        'sex': ['M', 'F', 'F', 'M', 'F', 'M']})
samples['count'] = 1
counts = samples.groupby(['handedness', 'sex'])['count'].sum().unstack().fillna(0)

counts is then:

	count
sex	F	M
handedness
left	3	1
right	0	2

It takes about 1 hour on my laptop to process a 13 GB CSV file of samples this way. It's a little cumbersome using the chunksize argument of pd.read_csv(). Perhaps this can be generalised nicely to huge datasets using cytoolz without sacrificing performance too much?! :-)

jni commented 9 years ago

@edschofield thanks!

Can you provide a line example of the original .csv to produce the DataFrame? So I can get an idea of how to produce it with toolz.

edschofield commented 9 years ago

The original data file is in the USEARCH cluster format for sequence analysis (see here: http://drive5.com/usearch/manual/opt_uc.html). An example is here:

http://www.hpsc.csiro.au/users/bis068/KELLY/BASE_Albany/reveg_16S/picrust/reveg_16S_picrust.0.90.readmap.uc

A very poor script to create a contingency table is provided here: http://drive5.com/python/uc2otutab_py.html; this takes over 3 days to process a 13 GB file on a beefy machine.

Only the last two columns from the sample file are interesting. The second-last column can be cleaned up by stripping its "barcode label" as follows:

def clean(s):
    start = s.index('=') + 1
    end = s.index(';')
    return s[start:end]

I realised today that the whole solution can be boiled down to something extremely simple using either collections.Counter or toolz.frequencies together with zip() or similar:

def parse_data(f):
    for line in f:
        cols = line.split('\t')
        yield clean(cols[-2]), cols[-1].strip()

with open('readmap.uc') as f:
    counts = collections.Counter(parse_data(f))

This version only takes 6min 43s on my laptop and is more elegant than using groupby and unstack with Pandas.

edschofield commented 9 years ago

Here's another simple toy example (on Py3, where zip() returns an iterator):

import string
categories1 = ['red', 'green', 'blue', 'black', 'yellow', 'pink', 'purple', 'chocolate']
categories2 = list(string.ascii_lowercase)

N = 10**6
data1 = np.random.choice(categories1, size=N)
data2 = np.random.choice(categories2, size=N)
samples = pd.DataFrame({'category1': data1, 'category2': data2})

Then you can create a contingency table elegantly and memory-efficiently using this line:

counts = collections.Counter(zip(samples['category1'], samples['category2']))

Using collections.Counter(zip(data1, data2)) on the NumPy arrays directly is more elegant still but, bizarrely, more than 3 times slower than by going through Pandas ...

jni commented 9 years ago

@edschofield

This version only takes 6min 43s on my laptop and is more elegant than using groupby and unstack with Pandas.

Toldja! ;) But this is nevertheless a great usecase to add to the streaming chapter!

NumPy arrays directly is more elegant still but, bizarrely, more than 3 times slower than by going through Pandas

That is bizarre! Definitely something to investigate further...!

elegant-scipy / elegant-scipy-submissions

Contingency table #14