bioperl / bioperl-live

Core BioPerl 1.x code
http://bioperl.org
296 stars 182 forks source link

How do I grab the corresponding sequence from the mitochondrial genome? #373

Closed CC3729 closed 1 year ago

CC3729 commented 1 year ago

Hello everyone: I have some mitochondrial genome NCBI accession numbers. Using bioperl, I can get the complete sequence of the mitochondrial genome. But how do I grab the corresponding sequence from the mitochondrial genome, suppose I only need some of CDS sequences or tRNA sequences, such as COI or cytb? I would like to know which functions of Bioperl need to be applied to solve my problem. Thank you!

carandraug commented 1 year ago

Easier if you give an example, but if I got it right, you load your sequence and then get its features. For example, you can get all CDS features with @seqs = $seq->get_SeqFeatures("CDS"). You can get a list of all features with:

foreach my $feat ( $seq->get_SeqFeatures() ) {
    print "Feature from ", $feat->start, "to ",
          $feat->end, " Primary tag  ", $feat->primary_tag,
          ", produced by ", $feat->source_tag(), "\n";

    foreach $tag ( $feat->get_all_tags() ) {
        print "Feature has tag ", $tag, " with values, ",
              join(' ',$feat->get_tag_values($tag)), "\n";
    }
}

You can get the sequence for the feature with $feat->seq().

See documentation for Bio::SeqFeatureI and Bio::Seq

CC3729 commented 1 year ago

Thank you for your reply. The code you provided can achieve the effect of grabbing all tag sequences, but my idea is to extract the same gene sequence from multiple mitochondrial genomes. For example, I have two mitochondrial genomes NCBI accession number, AY687385 and AY962573. The effect to be achieved is to capture the cytb genes from these two genomes into a result file in fasta format like the following:

AY687385_cytb_14227..15372 ATTTTATGGCCACCAACCTACGAAAAAACCCACCCAATAATTAAAATCATTAACAACTCACTAATTGACC TACCAAGTCCATCCAACATTTCCATTTGATGAAACTTTGGATCATTATTAGGAGCCTGCTTAATACTACA AATCATCACAGGCCTATTCCTAGCCATACATTACTCACCAAACATCTCAACAGCATTCTCGTCAATCGCC CACATTACCCGAGATGTACAATACGGTTGACTAATCCGCAACATACACGCTAACGGAGCCTCACTATTCT TCATGTGTATCTACCTACATATTGGACGAGGCCTATACTATGGATCCTACCTTTATAAACAAACCTGAAA CATTGGGGTAATCCTCCTACTACTAACCATAGCCACTGCATTCATGGGCTATGTCCTACCATGAGGACAA ATATCATTCTGAGGCGCCACAGTTATTACAAACCTACTCTCAGCTGTACCATATATCGGTACTACAATAG TGCAATGAGTATGAGGTGGTTTCTCCGTAGACAACGCCACTTTAACACGATTCTTTACCCTACATTTTTT ACTACCATTCATAATCCTAGGCCTAACCATAATTCACTTACTTTTATTACACGAAACAGGATCAAACAAC CCAACAGGACTTAACTCAAACATTGACAAAATCCCATTCCATCCTTACTTCTCATACAAAGATCTCCTAG GATTCATAATAACACTTACCCTACTTCTATCCATCGCCATATTTTACCCAAACCTATTAGGAGACCCAGA TAACTTCACACCAGCCAACCCACTATCCACCCCACCCCACATCAAACCAGAATGATATTTCCTATTCGCC TACGCTATCCTACGATCTATCCCTAACAAATTAGGAGGCGTACTAGCCCTACTACTCTCCATCTTAGTAT TATTTATCTTACCCCTACTACACACATCAAAACAACGAACACTAACATTCCGCCCTATCACCCAAACACT ATTCTGATTATTTGTGGCTAACCTTATAGTATTAACATGAATCGGAGGAAAACCAGTAGAAAACCCATTC ATCATTATCGGCCAAGCATCCTCCATCCTTTACTTTTTAATCCTACTAGTATTAATACCAATCTCAAACA TAATTGAAAATAAAACAACCAATTAA

AY962573_cytb_14221..15366 ATTTTATGGCCACCAATCTACGAAAAAACCCACCCAATAATTAAAATCATTAACAACTCACTAATTGACC TACCAAGTCCATCCAACATTTCCATTTGATGAAACTTTGGATCATTATTAGGAGCCTGCTTAATACTCCA GATCATCACAGGCCTATTCCTAGCCATACATTACTCACCAAACATCTCAACAGCATTCTCATCAATCGCC CACATTACCCGAGATGTACAATACGGTTGACTAATCCGCGACATACACGCTGACGGAGCCTCACTATTCT TCATGTGTATCTACCTACATATTGGACGAGGCCTATACTACGGATCCTACCTTTATAAACAAACTTGAAA CATCGGTGTAATCCTCCTACTACTAACCATAGCCACTGCATTCATGGGTTATGTCCTACCATGAGGACAA ATATCATTCTGAGGGGCCACAGTTATTACAAACCTACTCTCAGCTATTCCATATATCGGTACTACAATAG TGCAATGAGTATGAGGTGGTTTCTCCGTAGACAACGCCACTTTAACACGATTCTTTACCCTACATTTTTT ACTACCATTCATAATCCTAGGCCTAACCATAATTCACTTACTTTTATTACACGAAACAGGATCAAACAAC CCAACAGGACTTAACTCAAACATTGACAAAATTCCATTCCACCCTTACTTCTCATACAAAGATCTCCTAG GATTCATAATAACACTTACCCTGCTTCTATCCATCGCCATATTTTACCCAAACCTACCAGGAGACCCAGA TAACTTCACCCCAGCCAACCCACTATCCACCCCACCCCACATCAAACCAGAGTGATATTTCCTATTCGCC TACGCTATCCTACGATCTATCCCTAACAAATTAGGAGGCGTACTAGCCCTACTACTCTCCATCTTAGTAT TATTTATCCTACCCCTACTACACACATCAAAACAACGAACACTAACATTTCGACCTATCACCCAAACACT ATTCTGACTATTGGTAGCTAACCTTATAGTATTAACATGAATGGGGGGAAAACCAGTAGAAAACCCATTC ATCACTATCGGCCAAACATCCTCCATCCTTCACTTTTTAATCCTACTAGTATTAATACCAATCTCAAACA TAATCGAAAATAAAACAATCAATTAA

hyphaltip commented 1 year ago

Also see the spliced_seq() function which will splice out introns if you give a multi-location feature (eg mRNA or CDS with multiple locations) or https://metacpan.org/pod/Bio::SeqFeatureI#spliced_seq or if you just want the seq from start->end of the gene use the $feature->seq() function https://metacpan.org/pod/Bio::SeqFeatureI#seq