Closed numfar closed 12 years ago
The protein id points to an entry at ncbi (http://www.ncbi.nlm.nih.gov/protein/[the protein id]). This entry contains the sequence so we could retrieve the sequences from there instead of demanding from the user to have them locally (it will of course be slower to retrieve them remotely). Maybe allow the user to choose between local and remote?
There should be genome coordinates there somewhere as well but i haven't found them
Sounds like a good idea to let the user choose between local and remote.
In the ncbi entry there is a a part called "cds", there the third row(atleast in forexample this entry: http://www.ncbi.nlm.nih.gov/protein/253771531) contains information about the genome coordinates.
It looks like this: "/coded_by="NC_012947.1:102527..102802"" I assume that the coordinates are the numbers 102527..102802
You've found it!
http://www.ncbi.nlm.nih.gov/nuccore/NC_012947.1 is the link to whole genome and 102527..102802 is the coordinates. That means we can get this info remotely as well. Not sure on how to implement it though. The script could look for the genome in local folder specified by the user. If it doesn't it's not there it downloads the genome from ncbi.
Is there a BioPerl function for finding a coordinate in a genome/fasta file?
Yes there is the information indeed. I tried to download the information with a Perl script, but for some reason a Javascript on NCBI blocked information access. To work around the Javascript would've taken a lot of time, so instead I searched after alternatives. Found the same type of information as in your link http://www.ncbi.nlm.nih.gov/nuccore/NC_012947.1 on the ncbi FTP site ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Bartonella_bacilliformis_KC583_uid58533/NC_008783.gbk. Downloaded that and created a Perl script which stores the feature information in a hash, see repository.
Would say this issue is closed for now.
What are the protein id used in the cluster file? (how/where can you use them to find the protein sequence, or DNA seq and coordinates)