chrisquince / DESMAN

De novo Extraction of Strains from MetAgeNomes
Other
69 stars 22 forks source link

How can I obtain preidentified 982 SCSGs? #33

Open kmin940 opened 5 years ago

kmin940 commented 5 years ago

Hi, can you answer to these questions? It will be of great help for me.

In $DESMAN/complete_example directory, there is a file named EColi_core_ident95.txt. How can I obtain this kind of data for different microorganisms?

And is there a way to get pre-identified sequences for each of the 982 single copy core COGs from NCBI or any other site? COG database does not seem to be maintained. ( wget https://www.dropbox.com/s/f6ojp1qt4fz5lzn/Hits.tar.gz)

I want to have these data for different microorganisms. Thank you very much.

chrisquince commented 5 years ago

Hi,

The way to do this is to download some genomes from your species of interest. You probably need at least 50 and then determine which genes are single copy and core. There are many ways to do that but I reannotated genomes to ORFs using prodigal and then assigned ORFs to COGs with RPSBlast exactly as was done for contigs. Then just set some threshold say if 97% of the species have that gene and it is single copy then you will use it for the core. It is quite straightforward really.

Best, Chris

kmin940 commented 5 years ago

I see! Thank you very much for your clear answer! I will have a try:) Thank you again!

marcomeola commented 3 years ago

I suggest to download the species specific table from here: https://www.ncbi.nlm.nih.gov/research/cog/