UNF-PIPE / Tha-pipe

A fylogenetics pipeline
9 stars 1 forks source link

protein id #1

Closed numfar closed 12 years ago

numfar commented 12 years ago

What are the protein id used in the cluster file? (how/where can you use them to find the protein sequence, or DNA seq and coordinates)

simfor commented 12 years ago

The protein id points to an entry at ncbi (http://www.ncbi.nlm.nih.gov/protein/[the protein id]). This entry contains the sequence so we could retrieve the sequences from there instead of demanding from the user to have them locally (it will of course be slower to retrieve them remotely). Maybe allow the user to choose between local and remote?

There should be genome coordinates there somewhere as well but i haven't found them

numfar commented 12 years ago

Sounds like a good idea to let the user choose between local and remote.

In the ncbi entry there is a a part called "cds", there the third row(atleast in forexample this entry: http://www.ncbi.nlm.nih.gov/protein/253771531) contains information about the genome coordinates.

It looks like this: "/coded_by="NC_012947.1:102527..102802"" I assume that the coordinates are the numbers 102527..102802

simfor commented 12 years ago

You've found it!

http://www.ncbi.nlm.nih.gov/nuccore/NC_012947.1 is the link to whole genome and 102527..102802 is the coordinates. That means we can get this info remotely as well. Not sure on how to implement it though. The script could look for the genome in local folder specified by the user. If it doesn't it's not there it downloads the genome from ncbi.

Is there a BioPerl function for finding a coordinate in a genome/fasta file?

pappewaio commented 12 years ago

Yes there is the information indeed. I tried to download the information with a Perl script, but for some reason a Javascript on NCBI blocked information access. To work around the Javascript would've taken a lot of time, so instead I searched after alternatives. Found the same type of information as in your link http://www.ncbi.nlm.nih.gov/nuccore/NC_012947.1 on the ncbi FTP site ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Bartonella_bacilliformis_KC583_uid58533/NC_008783.gbk. Downloaded that and created a Perl script which stores the feature information in a hash, see repository.

Would say this issue is closed for now.