NCBI-Hackathons / EDirectCookbook

MIT License
159 stars 53 forks source link

Get coding sequences for a gene id #61

Open snk5040 opened 2 years ago

snk5040 commented 2 years ago

Hi everyone,

I would like to use a list of gene ids to get FASTA formats of the proteins coded in those genes and the mRNA sequence without introns.

So far with this command I can get the protein sequence: os.system('esearch -db gene -query "'+ "102888688" + ' [ID]" | elink -target protein -name gene_protein_refseq -cmd neighbor | xtract -pattern LinkSet -block IdList -element Id -block LinkSetDb -element Id | efetch -db protein -format fasta')

With this command I can get the mRNA with introns, which I don't want: os.system('elink -db gene -id ' + "102888688" + ' -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta')

vkkodali commented 2 years ago

You are better off doing this sort of thing using NCBI Datasets. That said, you can do this using EntrezDirect as follows:

$ elink -db gene -id 102888688 -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta | head -n2 
>NM_001290175.1 Pteropus alecto interferon induced with helicase C domain 1 (IFIH1), mRNA
AGAGCTGCGTCGCGAGAGAGCAGAGGCGGCTCCCTAGTCCCGGCCCCCGCGAGCACCGTAGAGTCAGAGG
$ elink -db gene -id 102888688 -target protein -name gene_protein_refseq | efetch -format fasta | head -n2 
>NP_001277104.1 interferon-induced helicase C domain-containing protein 1 [Pteropus alecto]
MSNEYSADKRFRYLISCFRARVKMYIQVEPVLDYLTFLSADMKEQIQRTATTMGNINAAEQLLSTLEKGV

Your command to get mRNA is correct. What makes you say that the output sequence has introns?

snk5040 commented 2 years ago

Great, thanks