blindner6 / CEE6720_BIO6720

Course exercises for CEE/BIO6720.
1 stars 0 forks source link

Plots for Classes and Orders using all.mothur file #8

Open HamzaGatech opened 3 months ago

HamzaGatech commented 3 months ago

So I followed the instructions for vsearch, got all.mothur file obtained a csv file and plotted for classes and orders using the python code. However, I feel there is something wrong as they don't match with Kraken results. so where did I go wrong? top10_classes_plot.pdf top20_orders_plot.pdf

here is the code I used which is already on one of the issues

import pandas as pd from warnings import simplefilter

simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

df = pd.read_csv("all.mothur", sep="\t", index_col=0, header=0)

df = df.loc[:, (df.sum(axis=0) > 1)]

detected_OTUs = df.columns # these are the column names we've kept after cleaning up the dataframe tax_legend = pd.read_csv("SILVA_taxonomy_legend.tsv", sep="\t", index_col=0, header=0).T # reads in the taxonomic info tax_legend = tax_legend[tax_legend.columns.intersection(detected_OTUs)] df = df.reindex(sorted(df.columns), axis=1) tax_legend = tax_legend.reindex(sorted(tax_legend.columns), axis=1)

all_classes = tax_legend.loc['rank3'].unique().tolist() df_classes = pd.DataFrame(index=df.index)

for classes in all_classes: tax_sum = df[tax_legend.columns[tax_legend.loc["rank3"] == classes]].sum(axis=1) df_classes[classes] = tax_sum

all_orders = tax_legend.loc['rank4'].unique().tolist() df_orders = pd.DataFrame(index=df.index)

for order in all_orders: tax_sum = df[tax_legend.columns[tax_legend.loc["rank4"] == order]].sum(axis=1) df_orders[order] = tax_sum

df_top10_class = pd.DataFrame(index=df.index) df_top20_order = pd.DataFrame(index=df.index)

top10 = df_classes.mean().sort_values()[-11:].index.tolist() top20 = df_orders.mean().sort_values()[-21:].index.tolist()

for order in top20: df_top20_order[order] = df_orders[order]

for clss in top10: df_top10_class[clss] = df_classes[clss]

df_top20_order["other"] = df_orders.sum(axis=1) - df_top20_order.sum(axis=1) df_top10_class["other"] = df_classes.sum(axis=1) - df_top10_class.sum(axis=1)

df_top20_order_norm = df_top20_order.div(df_top20_order.sum(axis=1), axis=0) df_top10_class_norm = df_top10_class.div(df_top10_class.sum(axis=1), axis=0)

print(df_top10_class_norm.head()) print(df_top20_order_norm.head())

For Plots

import pandas as pd import matplotlib.pyplot as plt

def plot_data(filename, plot_title, pdf_output):

Read the data, assuming that the delimiter is a comma

data = pd.read_csv(filename, index_col=0)

# Plot a horizontal bar plot with stacked bars for each sample
ax = data.plot(kind='barh', stacked=True, figsize=(12, 8), width=0.8, colormap='tab20')

# Set the labels and title of the plot
ax.set_xlabel('Fractional Read Abundance')
ax.set_ylabel('Samples')
ax.set_title(plot_title)

# Invert y-axis to have the first sample at the top
ax.invert_yaxis()

# Legend configuration
ax.legend(title='Taxonomic Classification', bbox_to_anchor=(1.05, 1), loc='upper left')

# Tight layout for better spacing
plt.tight_layout()

# Save the plot to a PDF
plt.savefig(pdf_output)

plt.close()

plot_data('top10_classes.csv', 'Top 10 Classes Fractional Read Abundance', 'top10_classes_plot.pdf') plot_data('top20_orders.csv', 'Top 20 Orders Fractional Read Abundance', 'top20_orders_plot.pdf')

blindner6 commented 3 months ago

Hi Hamza,

Thanks for the detailed post and I think you should treat the kraken results as the "ground truth". But your visualizations do look correct except 2014-winter is the only sample that slightly resembles the kraken2 plots, right?

Kostas made a comment that there's a mistake waiting to be found in lab02 that your figure speaks to. Do we expect this much Bacilli signal? What if you remove this class and re-factor everything to be from 0 to 1 and re-plot? Some things to think about and include in your report. And we will be addressing this when we meet again Monday after spring break.