blindner6 / CEE6720_BIO6720

Course exercises for CEE/BIO6720.
1 stars 0 forks source link

Vsearch data visualization #3

Open shernandez1217 opened 2 months ago

shernandez1217 commented 2 months ago

Hi Blake,

I am having issues with visualizing the vsearch data using Python after completing the last step to sorting the df and tax_legend. Could you provide some code that will help with visualizing this dataset?

Thanks!

blindner6 commented 2 months ago

Hey Sabrina,

Thanks for the comment! Here is one way you might prepare dataframes in python to make a heatmap or stacked barplots from for this part of the exercise. You can just use this code to clean up your data and then export it for use in R or excel -- or you can use this to prepare the data and then plot it in python using the approach of your choice:

Step 01: load your software and read in your data from vsearch:

#####################################################
# always at the beginning we need to import some software:
import pandas as pd
from warnings import simplefilter # turns off an annoying warning
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)
#####################################################
df = pd.read_csv("all.mothur", sep="\t",index_col=0,header=0)# read in the vsearch results and use the first row and column as index and header

Step 02: clean your dataframe:

df = df.loc[:, (df.sum(axis=0) > 1)]
# this will remove all OTUs which were observed only once or less in the timeseries

Step 03: get the taxonomic legend provided to you and re-format your dataframe:

detected_OTUs = df.columns # these are the column names we've kept after cleaning up the dataframe of OTUs with a sum of 0 or 1 
tax_legend = pd.read_csv("SILVA_taxonomy_legend.tsv", sep="\t", index_col=0, header=0).T # reads in the taxonomic info
tax_legend = tax_legend[tax_legend.columns.intersection(detected_OTUs)]
df = df.reindex(sorted(df.columns), axis=1)
tax_legend = tax_legend.reindex(sorted(tax_legend.columns), axis=1)

step 04: use the taxonomic legend to summarize the abundance of taxonomic groups (at the class and order level):

all_classes = tax_legend.loc['rank3'].unique().tolist()
df_classes = pd.DataFrame(index=df.index)

for classes in all_classes:
    tax_sum = df[tax_legend.columns[tax_legend.loc["rank3"] == classes]].sum(axis=1)
    df_classes[classes]=tax_sum

all_orders = tax_legend.loc['rank4'].unique().tolist()
df_orders = pd.DataFrame(index=df.index)

for order in all_orders:
    tax_sum = df[tax_legend.columns[tax_legend.loc["rank4"] == order]].sum(axis=1)
    df_orders[order]=tax_sum

Step 05: take just the top 10 and top 20 classes and orders, respectively (add everything else into a column called "other":

df_top10_class= pd.DataFrame(index=df.index)
df_top20_order= pd.DataFrame(index=df.index)

top10=df_classes.mean().sort_values()[-11:].index.tolist()
top20=df_orders.mean().sort_values()[-21:].index.tolist()

for order in top20:
    df_top20_order[order] = df_orders[order]

for clss in top10:
    df_top10_class[clss] = df_classes[clss]

df_top20_order["other"] = df_orders.sum(axis=1) - df_top20_order.sum(axis=1)
df_top10_class["other"] = df_classes.sum(axis=1) - df_top10_class.sum(axis=1)

Step 06: normalize each sample so that the counts add up to 1

df_top20_order_norm = df_top20_order.div(df_top20_order.sum(axis=1), axis=0) 
df_top10_class_norm = df_top10_class.div(df_top10_class.sum(axis=1), axis=0)

Step 07: visualize! either take your dataframes out of python into R or Excel by writing them to a CSV or stay in python and make your plots

# If you want to leave python:
df_top10_class_norm.to_csv("top10_classes.csv")
df_top20_order_norm.to_csv("top20_orders.csv")
# if you want to stay in python, take a look at these two dataframes and decide how you want to plot them:
df_top10_class_norm.head()
df_top20_order_norm.head()
jacobphaneuf commented 2 months ago

Hi,

When getting to Step 2 and running that line of code, I get the following error:

TypeError: '>' not supported between instances of 'str' and 'int'

I am not sure if this is because my all.mothur was incorrectly written, or due to some other issue? Thanks!

blindner6 commented 2 months ago

Hey Jake,

Did you sort this out? Can I see tail -n 1 all.mothur?

jacobphaneuf commented 2 months ago
Screenshot 2024-03-13 at 19 59 16

Here it is!

jacobphaneuf commented 2 months ago

I am going to rerun the step04.sbatch for vsearch, I think the mothur files got messed up along the way, and then try the above code again

blindner6 commented 2 months ago

Yeah I’m thinking the same thing based off what you shared above. Make sure the .fna files for each sample aren’t empty and check your Slurm log files to make sure the process didn’t fail.

Get Outlook for iOShttps://aka.ms/o0ukef


From: jacobphaneuf @.> Sent: Wednesday, March 13, 2024 8:12:42 PM To: blindner6/CEE6720_BIO6720 @.> Cc: Lindner, Blake G @.>; Comment @.> Subject: Re: [blindner6/CEE6720_BIO6720] Vsearch data visualization (Issue #3)

I am going to rerun the step04.sbatch for vsearch, I think the mothur files got messed up along the way, and then try the above code again

— Reply to this email directly, view it on GitHubhttps://github.com/blindner6/CEE6720_BIO6720/issues/3#issuecomment-1996167696, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIZ756G6J4DE7FY6YRTV2PLYYDTPVAVCNFSM6AAAAABEQ6RB56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJWGE3DONRZGY. You are receiving this because you commented.Message ID: @.***>