Open davipatti opened 6 years ago
Hi @davipatti
This feature would be very useful.
For example, the exact way the table h3map2004_hitable.csv
is preprocessed, and returns similar-looking map as shown:
Example map in Racmacs
If the bandwidth now allowing, would you please tell me the exact preprocessing steps to reproduce the map like above?
Thanks.
Best, Hansaim
Just to add to my comment above, I tried to convert the h3map2004_hitable.csv
into distance matrix and visualized using pymds. The results look like something is wrong.
It's perhaps due to some missing steps in the preprocessing, so the above functionality is indeed necessary to use pymds.
example_distance_file = './h3map2004_hitable.csv'
censored_distance_value = 1.0 # +- directly to the distance
missing_value = 0 # for non-metric MDS -- 0s considered missing value
zero_distance = 1e-6 # nonzero value to replace known 0s
minimum_column_basis = 6.0
missing_value_indicators = ['*', '-', 'NA', 'Unknown']
serum_values = defaultdict(list)
with open(example_distance_file, mode='r') as inf:
lines = csv.reader(inf)
antisera = next(lines)
if antisera[0] == '':
antisera = antisera[1:]
antigens = []
antigen2titers = {}
for line in lines:
if len(line) == 0:
continue
antigen = line[0].strip()
antigens.append(antigen)
titers = line[1:]
for idx,t in enumerate(titers):
## should never skip a value in this loop
# at this point, titer values
serum = antisera[idx]
if t in missing_value_indicators:
t = missing_value
elif '>' in t:
pass
elif '<' in t:
pass
else:
#known, uncensored, valid titer
t = float(t)
serum_values[serum].append(t) # to be used for column basis -- only valid titers
titers[idx] = t
antigen2titers[antigen] = titers # all titers for each antigen - censored titers stay censored
column_bases = {s:np.nanmax(serum_values[s]) for s in serum_values}
submatrix = np.zeros((len(antigens), len(antisera)))
for row_idx, antigen in enumerate(antigens):
row = [] # cleaned, distances
for col_idx, t in enumerate(antigen2titers[antigen]):
col_basis = column_bases.get(antisera[col_idx], minimum_column_basis)
if t == missing_value:
dist = t
elif '>' in str(t):
# titer greater than x
# distance shorter than log(x)
t = float(t[1:])
dist = (np.log2(col_basis) - np.log2(t)) - censored_distance_value # for censored >x data
if dist < zero_distance:
dist = zero_distance
elif '<' in str(t):
t = float(t[1:])
dist = (np.log2(col_basis) - np.log2(t)) + censored_distance_value # for censored <x data
if dist < zero_distance:
dist = zero_distance
else:
dist = np.log2(col_basis) - np.log2(t)
if dist < zero_distance:
dist = zero_distance
row.append(dist)
submatrix[row_idx,:] = row
# print(submatrix.shape)
# print(len(antigens), len(antisera))
antigens_antisera = antigens+antisera
matrix = np.zeros((len(antigens_antisera), len(antigens_antisera)))
matrix[0:len(antigens), len(antigens):] = submatrix
matrix[len(antigens):, 0:len(antigens)] = submatrix.T
original = DistanceMatrix(matrix).optimize(n=2)
shrunk = DistanceMatrix(matrix*0.65).optimize(n=2)
original.plot(c='black', edgecolor='white', s=50)
original.plot_lines_to(shrunk, linewidths=0.5, colors='black')
I foresee something along the lines of:
Create
pymds/antigenic.py
with:Where
make_distance_matrix
returns an instance ofpymds.mds.DistanceMatrix