Open HamzaGatech opened 3 months ago
Hi Hamza,
Thanks for the detailed post and I think you should treat the kraken results as the "ground truth". But your visualizations do look correct except 2014-winter
is the only sample that slightly resembles the kraken2 plots, right?
Kostas made a comment that there's a mistake waiting to be found in lab02 that your figure speaks to. Do we expect this much Bacilli signal? What if you remove this class and re-factor everything to be from 0 to 1 and re-plot? Some things to think about and include in your report. And we will be addressing this when we meet again Monday after spring break.
So I followed the instructions for vsearch, got all.mothur file obtained a csv file and plotted for classes and orders using the python code. However, I feel there is something wrong as they don't match with Kraken results. so where did I go wrong? top10_classes_plot.pdf top20_orders_plot.pdf
here is the code I used which is already on one of the issues
import pandas as pd from warnings import simplefilter
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)
df = pd.read_csv("all.mothur", sep="\t", index_col=0, header=0)
df = df.loc[:, (df.sum(axis=0) > 1)]
detected_OTUs = df.columns # these are the column names we've kept after cleaning up the dataframe tax_legend = pd.read_csv("SILVA_taxonomy_legend.tsv", sep="\t", index_col=0, header=0).T # reads in the taxonomic info tax_legend = tax_legend[tax_legend.columns.intersection(detected_OTUs)] df = df.reindex(sorted(df.columns), axis=1) tax_legend = tax_legend.reindex(sorted(tax_legend.columns), axis=1)
all_classes = tax_legend.loc['rank3'].unique().tolist() df_classes = pd.DataFrame(index=df.index)
for classes in all_classes: tax_sum = df[tax_legend.columns[tax_legend.loc["rank3"] == classes]].sum(axis=1) df_classes[classes] = tax_sum
all_orders = tax_legend.loc['rank4'].unique().tolist() df_orders = pd.DataFrame(index=df.index)
for order in all_orders: tax_sum = df[tax_legend.columns[tax_legend.loc["rank4"] == order]].sum(axis=1) df_orders[order] = tax_sum
df_top10_class = pd.DataFrame(index=df.index) df_top20_order = pd.DataFrame(index=df.index)
top10 = df_classes.mean().sort_values()[-11:].index.tolist() top20 = df_orders.mean().sort_values()[-21:].index.tolist()
for order in top20: df_top20_order[order] = df_orders[order]
for clss in top10: df_top10_class[clss] = df_classes[clss]
df_top20_order["other"] = df_orders.sum(axis=1) - df_top20_order.sum(axis=1) df_top10_class["other"] = df_classes.sum(axis=1) - df_top10_class.sum(axis=1)
df_top20_order_norm = df_top20_order.div(df_top20_order.sum(axis=1), axis=0) df_top10_class_norm = df_top10_class.div(df_top10_class.sum(axis=1), axis=0)
print(df_top10_class_norm.head()) print(df_top20_order_norm.head())
For Plots
import pandas as pd import matplotlib.pyplot as plt
def plot_data(filename, plot_title, pdf_output):
Read the data, assuming that the delimiter is a comma
plot_data('top10_classes.csv', 'Top 10 Classes Fractional Read Abundance', 'top10_classes_plot.pdf') plot_data('top20_orders.csv', 'Top 20 Orders Fractional Read Abundance', 'top20_orders_plot.pdf')