Open xfy9 opened 2 years ago
Hi,
You need to perform multiple sequence alignment after the clustering.
Best, Lei
On Tue, Mar 1, 2022 at 7:49 AM xfy9 @.***> wrote:
Hi, I have ran cdhit.sh and generated hmdb_cluster30.clstr,hmdb_cluster60.clstr,hmdb_cluster90.clstr,hmdb_cluster9060.clstr,hmdb_cluster906030.clstr, but I am confused about how to use them. Can you tell me how to use them. And I also confued clustalo.sh , I want to know what file that HMDB_clustered_fasta stored, how to get them. I'd appreciate it if you helped me.
— Reply to this email directly, view it on GitHub https://github.com/XieResearchGroup/DISAE/issues/11, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZSBCSSUQJ53IQSZZSPAMDU5YG6PANCNFSM5PT6SS5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi, You may need a script as below to retrieve clusters from the cluster906030 file. I will add the script to our repo later. Thank you for pointing this out.
import os
import json
class ClstrAnalyzer:
def __init__(self, handler):
""" Analyze the cluster file output from cd-hit
Args:
handler (File object): the file handler of the .clstr file
"""
self.handler = handler
def get_clusters(self, id_sep="..."):
""" Get the clusters in the file.
Args:
id_sep (str): the string used to separate the id from the cluster line. E.g.
in case of "0 23217aa, >HMPREF1486_05052... *", the separator should
be "...", while in case of "0 1364aa, >A0A1I7RXC9_BURXY/84-1447|A0A1I7
RXC9_BURXY|84-1447|A0A1I7RXC9.1... *", the separator should be "|".
Default is "...".
Return (iterator):
The method returns a iterator of the entry ids in lists.
"""
ids = list()
for line in self.handler:
if line.startswith(">"):
if len(ids) > 0:
yield ids
ids = list()
else:
ids.append(line.split(">")[1].split(id_sep)[0])
if len(ids) > 0:
yield ids # the last cluster
hmp_root = os.path.join("..", "Data", "HMP_metagenome")
saving_root = os.path.join(hmp_root, "HMP_clustered_fasta")
filtered_hmp_fastas = os.path.join(hmp_root, "filtered_fasta.json")
clusters_file = os.path.join(
hmp_root,
"HMP_clustered_corpora",
"hmp_cdhit_clusters906030",
"filtered_hmp_cluster906030.clstr",
)
with open(filtered_hmp_fastas, "r") as f:
fastas = json.load(f)
short_keys = {}
for k in fastas.keys():
short = k.split()[0]
short_keys[short] = k
written = set()
with open(clusters_file, "r") as f:
clusters = ClstrAnalyzer(f).get_clusters(id_sep="...")
for i, cluster in enumerate(clusters):
with open(os.path.join(saving_root, f"cluster{i}.fasta"), "w") as outf:
for prot_id in cluster:
if prot_id not in written:
full_key = short_keys[prot_id]
outf.write(">"+full_key+"\n")
outf.write(fastas[full_key]+"\n")
written.add(prot_id)
Hi, You may need a script as below to retrieve clusters from the cluster906030 file. I will add the script to our repo later. Thank you for pointing this out.
import os import json class ClstrAnalyzer: def __init__(self, handler): """ Analyze the cluster file output from cd-hit Args: handler (File object): the file handler of the .clstr file """ self.handler = handler def get_clusters(self, id_sep="..."): """ Get the clusters in the file. Args: id_sep (str): the string used to separate the id from the cluster line. E.g. in case of "0 23217aa, >HMPREF1486_05052... *", the separator should be "...", while in case of "0 1364aa, >A0A1I7RXC9_BURXY/84-1447|A0A1I7 RXC9_BURXY|84-1447|A0A1I7RXC9.1... *", the separator should be "|". Default is "...". Return (iterator): The method returns a iterator of the entry ids in lists. """ ids = list() for line in self.handler: if line.startswith(">"): if len(ids) > 0: yield ids ids = list() else: ids.append(line.split(">")[1].split(id_sep)[0]) if len(ids) > 0: yield ids # the last cluster hmp_root = os.path.join("..", "Data", "HMP_metagenome") saving_root = os.path.join(hmp_root, "HMP_clustered_fasta") filtered_hmp_fastas = os.path.join(hmp_root, "filtered_fasta.json") clusters_file = os.path.join( hmp_root, "HMP_clustered_corpora", "hmp_cdhit_clusters906030", "filtered_hmp_cluster906030.clstr", ) with open(filtered_hmp_fastas, "r") as f: fastas = json.load(f) short_keys = {} for k in fastas.keys(): short = k.split()[0] short_keys[short] = k written = set() with open(clusters_file, "r") as f: clusters = ClstrAnalyzer(f).get_clusters(id_sep="...") for i, cluster in enumerate(clusters): with open(os.path.join(saving_root, f"cluster{i}.fasta"), "w") as outf: for prot_id in cluster: if prot_id not in written: full_key = short_keys[prot_id] outf.write(">"+full_key+"\n") outf.write(fastas[full_key]+"\n") written.add(prot_id)
Hi,thank you for your patience to answer me. But I have a problem with the script you provided. How do I generate filtered_fasta.json, I have found it in thesis-works and MicrobiomeMeta, but I am not find the method that generate filtered_fasta.json. I want to konw the content of filtered_fasta.json, and how do I generate filtered_fasta.json. I would be very grateful if you answer my questions.
It is a json file with protein IDs as keys and fasta sequences as values. "Filtered" simply means we did a filtration for the sequences to keep only the sequences we want for our experiments. You should have your own script to generate that data in practice.
Hi, I have ran cdhit.sh and generated hmdb_cluster30.clstr,hmdb_cluster60.clstr,hmdb_cluster90.clstr,hmdb_cluster9060.clstr,hmdb_cluster906030.clstr, but I am confused about how to use them. Can you tell me how to use them. And I also confued clustalo.sh , I want to know what file that HMDB_clustered_fasta stored, how to get them. I'd appreciate it if you helped me.