core genes list annotations

apoorva004 commented 2 years ago

Hi I have generated the core genes list (pangenome_matrix_t0_core_list) during the genome analysis in my bacterial strains. I would like to if there is any way that I can annotate and classify these genes in functional category. I am interested in functional aspects of these clusters. Thanks.

eead-csic-compbio commented 2 years ago

Hi @apoorva004 , I can see at least two options, which require downloading the PfamA database with script install.pl:

You can annotate Pfam protein domains of selected clusters with _annotatecluster.pl as explained on section 4.7 Annotating a sequence cluster in the get_homologues-est manual
If you re-run get_homologues.pl with option -D Pfam domains will be called in all input sequences and thus you can compute functional enrichment of gene sets as explained in section 4.9.5 Calculating Pfam enrichment of cluster sets. In my experience the core set might not be particularly enriched, but there are usually significantly increased or reduced numbers of selected Pfam domains in accessory sets.

Hope this helps, any other ideas @vinuesa ? Bruno

vinuesa commented 2 years ago

Hi @apoorva004, with the tools currently distributed in the get_homologues distro, following the suggestions by @eead-csic-compbio is the best you can do to get some functional annoation based on the PFAM domain composition. You may want to download the latest COG data from https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/ and run blast against it to obtain the "classic" one-letter functional categories, if that is what you would like.

eead-csic-compbio commented 2 years ago

Just to add to @vinuesa 's suggestion, you could use the file https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/cog-20.fa.gz and blast it against your clusters with script _make_nr_pangenomematrix.pl and option -f, hope this helps, Bruno

eead-csic-compbio / get_homologues

core genes list annotations #97