Open shernandez1217 opened 2 months ago
Hey Sabrina,
Thanks for the comment! Here is one way you might prepare dataframes in python to make a heatmap or stacked barplots from for this part of the exercise. You can just use this code to clean up your data and then export it for use in R or excel -- or you can use this to prepare the data and then plot it in python using the approach of your choice:
Step 01: load your software and read in your data from vsearch:
#####################################################
# always at the beginning we need to import some software:
import pandas as pd
from warnings import simplefilter # turns off an annoying warning
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)
#####################################################
df = pd.read_csv("all.mothur", sep="\t",index_col=0,header=0)# read in the vsearch results and use the first row and column as index and header
Step 02: clean your dataframe:
df = df.loc[:, (df.sum(axis=0) > 1)]
# this will remove all OTUs which were observed only once or less in the timeseries
Step 03: get the taxonomic legend provided to you and re-format your dataframe:
detected_OTUs = df.columns # these are the column names we've kept after cleaning up the dataframe of OTUs with a sum of 0 or 1
tax_legend = pd.read_csv("SILVA_taxonomy_legend.tsv", sep="\t", index_col=0, header=0).T # reads in the taxonomic info
tax_legend = tax_legend[tax_legend.columns.intersection(detected_OTUs)]
df = df.reindex(sorted(df.columns), axis=1)
tax_legend = tax_legend.reindex(sorted(tax_legend.columns), axis=1)
step 04: use the taxonomic legend to summarize the abundance of taxonomic groups (at the class and order level):
all_classes = tax_legend.loc['rank3'].unique().tolist()
df_classes = pd.DataFrame(index=df.index)
for classes in all_classes:
tax_sum = df[tax_legend.columns[tax_legend.loc["rank3"] == classes]].sum(axis=1)
df_classes[classes]=tax_sum
all_orders = tax_legend.loc['rank4'].unique().tolist()
df_orders = pd.DataFrame(index=df.index)
for order in all_orders:
tax_sum = df[tax_legend.columns[tax_legend.loc["rank4"] == order]].sum(axis=1)
df_orders[order]=tax_sum
Step 05: take just the top 10 and top 20 classes and orders, respectively (add everything else into a column called "other":
df_top10_class= pd.DataFrame(index=df.index)
df_top20_order= pd.DataFrame(index=df.index)
top10=df_classes.mean().sort_values()[-11:].index.tolist()
top20=df_orders.mean().sort_values()[-21:].index.tolist()
for order in top20:
df_top20_order[order] = df_orders[order]
for clss in top10:
df_top10_class[clss] = df_classes[clss]
df_top20_order["other"] = df_orders.sum(axis=1) - df_top20_order.sum(axis=1)
df_top10_class["other"] = df_classes.sum(axis=1) - df_top10_class.sum(axis=1)
Step 06: normalize each sample so that the counts add up to 1
df_top20_order_norm = df_top20_order.div(df_top20_order.sum(axis=1), axis=0)
df_top10_class_norm = df_top10_class.div(df_top10_class.sum(axis=1), axis=0)
Step 07: visualize! either take your dataframes out of python into R or Excel by writing them to a CSV or stay in python and make your plots
# If you want to leave python:
df_top10_class_norm.to_csv("top10_classes.csv")
df_top20_order_norm.to_csv("top20_orders.csv")
# if you want to stay in python, take a look at these two dataframes and decide how you want to plot them:
df_top10_class_norm.head()
df_top20_order_norm.head()
Hi,
When getting to Step 2 and running that line of code, I get the following error:
TypeError: '>' not supported between instances of 'str' and 'int'
I am not sure if this is because my all.mothur was incorrectly written, or due to some other issue? Thanks!
Hey Jake,
Did you sort this out? Can I see tail -n 1 all.mothur
?
Here it is!
I am going to rerun the step04.sbatch for vsearch, I think the mothur files got messed up along the way, and then try the above code again
Yeah I’m thinking the same thing based off what you shared above. Make sure the .fna files for each sample aren’t empty and check your Slurm log files to make sure the process didn’t fail.
Get Outlook for iOShttps://aka.ms/o0ukef
From: jacobphaneuf @.> Sent: Wednesday, March 13, 2024 8:12:42 PM To: blindner6/CEE6720_BIO6720 @.> Cc: Lindner, Blake G @.>; Comment @.> Subject: Re: [blindner6/CEE6720_BIO6720] Vsearch data visualization (Issue #3)
I am going to rerun the step04.sbatch for vsearch, I think the mothur files got messed up along the way, and then try the above code again
— Reply to this email directly, view it on GitHubhttps://github.com/blindner6/CEE6720_BIO6720/issues/3#issuecomment-1996167696, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIZ756G6J4DE7FY6YRTV2PLYYDTPVAVCNFSM6AAAAABEQ6RB56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJWGE3DONRZGY. You are receiving this because you commented.Message ID: @.***>
Hi Blake,
I am having issues with visualizing the vsearch data using Python after completing the last step to sorting the df and tax_legend. Could you provide some code that will help with visualizing this dataset?
Thanks!