acorg / pymds

Metric multidimensional scaling in python
MIT License
4 stars 1 forks source link

Implement class for handling HI tables #2

Open davipatti opened 6 years ago

davipatti commented 6 years ago

I foresee something along the lines of:

Create pymds/antigenic.py with:

class Table(object):

    def __init___(self, table_data):
        self.dm = make_distance_matrix(table_data)

Where make_distance_matrix returns an instance of pymds.mds.DistanceMatrix

limh25 commented 3 months ago

Hi @davipatti

This feature would be very useful. For example, the exact way the table h3map2004_hitable.csv is preprocessed, and returns similar-looking map as shown: Example map in Racmacs

If the bandwidth now allowing, would you please tell me the exact preprocessing steps to reproduce the map like above?

Thanks.

Best, Hansaim

limh25 commented 3 months ago

Just to add to my comment above, I tried to convert the h3map2004_hitable.csv into distance matrix and visualized using pymds. The results look like something is wrong. It's perhaps due to some missing steps in the preprocessing, so the above functionality is indeed necessary to use pymds.

example_distance_file = './h3map2004_hitable.csv'
censored_distance_value = 1.0  # +- directly to the distance
missing_value = 0  # for non-metric MDS -- 0s considered missing value
zero_distance = 1e-6  # nonzero value to replace known 0s
minimum_column_basis = 6.0
missing_value_indicators = ['*', '-', 'NA', 'Unknown']
serum_values = defaultdict(list)
with open(example_distance_file, mode='r') as inf:
    lines = csv.reader(inf)
    antisera = next(lines)
    if antisera[0] == '':
        antisera = antisera[1:]
    antigens = []
    antigen2titers = {}
    for line in lines:
        if len(line) == 0:
            continue
        antigen = line[0].strip()
        antigens.append(antigen)
        titers = line[1:]
        for idx,t in enumerate(titers):
            ## should never skip a value in this loop
            # at this point, titer values
            serum = antisera[idx]
            if t in missing_value_indicators:
                t = missing_value
            elif '>' in t:
                pass
            elif '<' in t:
                pass
            else:
                #known, uncensored, valid titer
                t = float(t)
                serum_values[serum].append(t)  # to be used for column basis -- only valid titers
            titers[idx] = t

        antigen2titers[antigen] = titers  # all titers for each antigen - censored titers stay censored
    column_bases = {s:np.nanmax(serum_values[s]) for s in serum_values}
    submatrix = np.zeros((len(antigens), len(antisera)))
    for row_idx, antigen in enumerate(antigens):
        row = []  # cleaned, distances
        for col_idx, t in enumerate(antigen2titers[antigen]):
            col_basis = column_bases.get(antisera[col_idx], minimum_column_basis)
            if t == missing_value:
                dist = t
            elif '>' in str(t):
                # titer greater than x
                # distance shorter than log(x)
                t = float(t[1:])
                dist = (np.log2(col_basis) - np.log2(t)) - censored_distance_value  # for censored >x data
                if dist < zero_distance:
                    dist = zero_distance
            elif '<' in str(t):
                t = float(t[1:])
                dist = (np.log2(col_basis) - np.log2(t)) + censored_distance_value # for censored <x data
                if dist < zero_distance:
                    dist = zero_distance
            else:
                dist = np.log2(col_basis) - np.log2(t)
                if dist < zero_distance:
                    dist = zero_distance
            row.append(dist)
        submatrix[row_idx,:] = row
    # print(submatrix.shape)
    # print(len(antigens), len(antisera))

    antigens_antisera = antigens+antisera
    matrix = np.zeros((len(antigens_antisera), len(antigens_antisera)))
    matrix[0:len(antigens), len(antigens):] = submatrix
    matrix[len(antigens):, 0:len(antigens)] = submatrix.T

original = DistanceMatrix(matrix).optimize(n=2)
shrunk = DistanceMatrix(matrix*0.65).optimize(n=2)
original.plot(c='black', edgecolor='white', s=50)
original.plot_lines_to(shrunk, linewidths=0.5, colors='black')

image