XieResearchGroup / DISAE

MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization
Other
11 stars 4 forks source link

Question about cdhit.sh generate .clstr file and clustalo.sh #11

Open xfy9 opened 2 years ago

xfy9 commented 2 years ago

Hi, I have ran cdhit.sh and generated hmdb_cluster30.clstr,hmdb_cluster60.clstr,hmdb_cluster90.clstr,hmdb_cluster9060.clstr,hmdb_cluster906030.clstr, but I am confused about how to use them. Can you tell me how to use them. And I also confued clustalo.sh , I want to know what file that HMDB_clustered_fasta stored, how to get them. I'd appreciate it if you helped me.

lxie21 commented 2 years ago

Hi,

You need to perform multiple sequence alignment after the clustering.

Best, Lei

On Tue, Mar 1, 2022 at 7:49 AM xfy9 @.***> wrote:

Hi, I have ran cdhit.sh and generated hmdb_cluster30.clstr,hmdb_cluster60.clstr,hmdb_cluster90.clstr,hmdb_cluster9060.clstr,hmdb_cluster906030.clstr, but I am confused about how to use them. Can you tell me how to use them. And I also confued clustalo.sh , I want to know what file that HMDB_clustered_fasta stored, how to get them. I'd appreciate it if you helped me.

— Reply to this email directly, view it on GitHub https://github.com/XieResearchGroup/DISAE/issues/11, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZSBCSSUQJ53IQSZZSPAMDU5YG6PANCNFSM5PT6SS5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

thomasly commented 2 years ago

Hi, You may need a script as below to retrieve clusters from the cluster906030 file. I will add the script to our repo later. Thank you for pointing this out.

import os
import json

class ClstrAnalyzer:
    def __init__(self, handler):
        """ Analyze the cluster file output from cd-hit

        Args:
            handler (File object): the file handler of the .clstr file
        """
        self.handler = handler

    def get_clusters(self, id_sep="..."):
        """ Get the clusters in the file.

        Args:
            id_sep (str): the string used to separate the id from the cluster line. E.g.
                in case of "0   23217aa, >HMPREF1486_05052... *", the separator should
                be "...", while in case of "0   1364aa, >A0A1I7RXC9_BURXY/84-1447|A0A1I7
                RXC9_BURXY|84-1447|A0A1I7RXC9.1... *", the separator should be "|".
                Default is "...".

        Return (iterator):
            The method returns a iterator of the entry ids in lists.
        """
        ids = list()
        for line in self.handler:
            if line.startswith(">"):
                if len(ids) > 0:
                    yield ids
                ids = list()
            else:
                ids.append(line.split(">")[1].split(id_sep)[0])
        if len(ids) > 0:
            yield ids  # the last cluster

hmp_root = os.path.join("..", "Data", "HMP_metagenome")
saving_root = os.path.join(hmp_root, "HMP_clustered_fasta")
filtered_hmp_fastas = os.path.join(hmp_root, "filtered_fasta.json")
clusters_file = os.path.join(
    hmp_root,
    "HMP_clustered_corpora",
    "hmp_cdhit_clusters906030",
    "filtered_hmp_cluster906030.clstr",
)

with open(filtered_hmp_fastas, "r") as f:
    fastas = json.load(f)

short_keys = {}
for k in fastas.keys():
    short = k.split()[0]
    short_keys[short] = k

written = set()
with open(clusters_file, "r") as f:
    clusters = ClstrAnalyzer(f).get_clusters(id_sep="...")
    for i, cluster in enumerate(clusters):
        with open(os.path.join(saving_root, f"cluster{i}.fasta"), "w") as outf:
            for prot_id in cluster:
                if prot_id not in written:
                    full_key = short_keys[prot_id]
                    outf.write(">"+full_key+"\n")
                    outf.write(fastas[full_key]+"\n")
                    written.add(prot_id)
xfy9 commented 2 years ago

Hi, You may need a script as below to retrieve clusters from the cluster906030 file. I will add the script to our repo later. Thank you for pointing this out.

import os
import json

class ClstrAnalyzer:
    def __init__(self, handler):
        """ Analyze the cluster file output from cd-hit

        Args:
            handler (File object): the file handler of the .clstr file
        """
        self.handler = handler

    def get_clusters(self, id_sep="..."):
        """ Get the clusters in the file.

        Args:
            id_sep (str): the string used to separate the id from the cluster line. E.g.
                in case of "0 23217aa, >HMPREF1486_05052... *", the separator should
                be "...", while in case of "0 1364aa, >A0A1I7RXC9_BURXY/84-1447|A0A1I7
                RXC9_BURXY|84-1447|A0A1I7RXC9.1... *", the separator should be "|".
                Default is "...".

        Return (iterator):
            The method returns a iterator of the entry ids in lists.
        """
        ids = list()
        for line in self.handler:
            if line.startswith(">"):
                if len(ids) > 0:
                    yield ids
                ids = list()
            else:
                ids.append(line.split(">")[1].split(id_sep)[0])
        if len(ids) > 0:
            yield ids  # the last cluster

hmp_root = os.path.join("..", "Data", "HMP_metagenome")
saving_root = os.path.join(hmp_root, "HMP_clustered_fasta")
filtered_hmp_fastas = os.path.join(hmp_root, "filtered_fasta.json")
clusters_file = os.path.join(
    hmp_root,
    "HMP_clustered_corpora",
    "hmp_cdhit_clusters906030",
    "filtered_hmp_cluster906030.clstr",
)

with open(filtered_hmp_fastas, "r") as f:
    fastas = json.load(f)

short_keys = {}
for k in fastas.keys():
    short = k.split()[0]
    short_keys[short] = k

written = set()
with open(clusters_file, "r") as f:
    clusters = ClstrAnalyzer(f).get_clusters(id_sep="...")
    for i, cluster in enumerate(clusters):
        with open(os.path.join(saving_root, f"cluster{i}.fasta"), "w") as outf:
            for prot_id in cluster:
                if prot_id not in written:
                    full_key = short_keys[prot_id]
                    outf.write(">"+full_key+"\n")
                    outf.write(fastas[full_key]+"\n")
                    written.add(prot_id)

Hi,thank you for your patience to answer me. But I have a problem with the script you provided. How do I generate filtered_fasta.json, I have found it in thesis-works and MicrobiomeMeta, but I am not find the method that generate filtered_fasta.json. I want to konw the content of filtered_fasta.json, and how do I generate filtered_fasta.json. I would be very grateful if you answer my questions.

thomasly commented 2 years ago

It is a json file with protein IDs as keys and fasta sequences as values. "Filtered" simply means we did a filtration for the sequences to keep only the sequences we want for our experiments. You should have your own script to generate that data in practice.