bioperl / bioperl-live

Core BioPerl 1.x code
http://bioperl.org
298 stars 182 forks source link

Bio::SeqFeature::Tools::Unflattener does not convert pseudogene correctly #124

Open sufenhu9 opened 9 years ago

sufenhu9 commented 9 years ago

When I use Bio::SeqFeature::Tools::Unflattener to convert GenBank flat-feature-list to containment hierarchy,

$unflattener->unflatten_seq(-seq=>$seq, -use_magic=>1);

Everything is fine except these genes has pseudo tag.

  1. Gene and CDS with pseudo tag have been convert to pseudogene and pseudoCDS, but they are flat. That is means the pseudoCDS is not the child of the pseudogene.
  2. Gene and tRNA with pseudo tag have been convert to pseudogene and pseudotRNA, but they are not nested either. The converted pseudotRNA are not the child of the pseudogene.

Like to know if there is any other parameter, or any other method that I can use to convert both gene and pseudogene correctly.

sample file

http://www.ncbi.nlm.nih.gov/nuccore/FR823391 http://www.ncbi.nlm.nih.gov/nuccore/GL636509

sample codes

#!/usr/bin/perl
use strict;
use Bio::SeqIO;
use Bio::Seq::SeqBuilder;
use Bio::Species;
use Bio::Annotation::SimpleValue;
use Bio::SeqFeature::Tools::Unflattener;

my $seq_in = Bio::SeqIO->new( '-file' => "<$ARGV[0]", '-format' => 'Genbank' );
my $unflattener = Bio::SeqFeature::Tools::Unflattener->new;

while ( my $bioperlSeq = $seq_in->next_seq() ) {
    if ( !( $bioperlSeq->molecule =~ /rna/i ) ) {
        $unflattener->error_threshold(1);
        $unflattener->report_problems(*STDERR);
        $unflattener->unflatten_seq(
            -seq       => $bioperlSeq,
            -use_magic => 1
        );

        my @topSeqFeatures = $bioperlSeq->get_SeqFeatures;

        $bioperlSeq->remove_SeqFeatures;

        foreach my $bioperlFeatureTree (@topSeqFeatures) {
            my $type = $bioperlFeatureTree->primary_tag();
            if ( ( $bioperlFeatureTree->has_tag("locus_tag") ) ) {
                my ($cID) = $bioperlFeatureTree->get_tag_values("locus_tag");
                print STDERR "\nprocessing $cID...\n";
            }

            if (   $type =~ /pseudogene/i
                || $type =~ /pseudotRNA/i
                || $type =~ /pseudoCDS/i )
            {
                foreach my $subPseuFeat ( $bioperlFeatureTree->get_SeqFeatures )
                {
                    my $subPseuType = $subPseuFeat->primary_tag();
                    print STDERR "  subPseuType = $subPseuType\n";
                    my @pseuLocs = $subPseuFeat->location;
                    foreach my $pseuLoc (@pseuLocs) {
                        my $pseuLocStart = $pseuLoc->start();
                        my $pseuLocEnd   = $pseuLoc->end();
                        print "    pseuLoc location : $pseuLocStart ... $pseuLocEnd\n";
                    }
                }
            }
            print STDERR "\$type = $type\n";
            foreach my $subFeat ( $bioperlFeatureTree->get_SeqFeatures ) {
                my $subType = $subFeat->primary_tag();
                print STDERR "  subType = $subType\n";
                foreach my $subSubFeat ( $subFeat->get_SeqFeatures ) {
                    my $subSubType = $subSubFeat->primary_tag();
                    print STDERR "      subSubType = $subSubType\n";
                    if ( $subSubType eq "CDS" || $subSubType eq "exon" ) {
                        my @CDSLocs = $subSubFeat->location;
                        foreach my $CDSLoc (@CDSLocs) {
                            my $subSubStart = $CDSLoc->start();
                            my $subSubEnd   = $CDSLoc->end();
                            print STDERR
"            location: $subSubStart .... $subSubEnd\n";
                        }
                    }
                }
            }
        }
    }
}

I am using bioperl_live/1.6.9, $ perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"' 1.0069

Everything is fine when I use bioperl version 1.4. Not sure what is changed for Bio::SeqFeature::Tools::Unflattener between 1.4 and 1.6.9.

The different outputs with above code in different bioperl version 1.6.9 and 1.4.

####### BioPerl 1.6.9 #########
processing NCLIV_047911...
$type = pseudogene
processing NCLIV_047911...
$type = pseudoCDS

####### BioPerl 1.4 #########
processing NCLIV_047911...
$type = gene
subType = mRNA
subSubType = CDS
location: 2628722 .... 2629917
subSubType = exon
location: 2628722 .... 2629792
subSubType = exon
location: 2629795 .... 2629917

Thanks.

cjfields commented 9 years ago

Relevant feature in the GenBank file is:

...
     gene            complement(2628722..2629917)
                     /locus_tag="NCLIV_047911"
                     /pseudo
     CDS             complement(join(2628722..2629792,2629795..2629917))
                     /locus_tag="NCLIV_047911"
                     /pseudo
                     /codon_start=1
                     /product="hypothetical protein"
                     /db_xref="PSEUDO:CBZ54361.1"
...

not sure why it has changed, @cmungall any ideas?

cjfields commented 9 years ago

May try running git bisect to see where this changed. My feeling is a conversion to add the pseudo tag to the feature, though it's doing the right thing, actually clobbers the unflattening.

cjfields commented 9 years ago

git bisect indicates the change occurred way back in 2004 in 3a404dfbc73fec21247022d27d29a5c64ac428aa from @cmungall

I'll open a branch for testing fixes.

cmungall commented 9 years ago

I'm afraid the exact rationale escapes me at the moment. I assume it was to do with an asymmetry in how pseudogene models were typically encoded in genbank records at the time.

For this particular example, the pseudogene mirrors the structure of a gene precisely, even having a CDS. It looks like the code should be simplified here so that pseudogenes are structured symmetrically to genes. But it's not clear what the consequences of making this change would be for other pseudogene records in genbank. Also, there is a secondary issue that the pseudogene hierarchy doesnt mirror the gene one entirely in SO (e.g. no pseudoCDS)

sufenhu9 commented 9 years ago

Thanks for working on this issue. Yes, there is no pseudoCDS in SO so far. There is pseudogenic_exon instead.

For the change in BioPerl 1.6.9

   # PSEUDOGENES, PSEUDOEXONS AND PSEUDOINTRONS
   # these are indicated with the /pseudo tag
   # these are mapped to a different type; they should NOT
   # be treated as normal genes
   foreach my $sf (@all_seq_features) {
       if ($sf->has_tag('pseudo')) {
           my $type = $sf->primary_tag;
           # SO type is typically the same as the normal
           # type but preceeded by "pseudo"
           if ($type eq 'misc_RNA' || $type eq 'mRNA') { 
            # dgg: see TypeMapper; both pseudo mRNA,misc_RNA should be pseudogenic_transcript
               $sf->primary_tag("pseudotranscript");
           }
           else {
               $sf->primary_tag("pseudo$type");
           }
       }
   }

I propose the following,

   foreach my $sf (@all_seq_features) {
       if ($sf->has_tag('pseudo')) {
           my $type = $sf->primary_tag;
           if ($type eq 'gene') { 
               $sf->primary_tag("pseudogene");
           } elsif ($type eq 'CDS') {
               $sf->primary_tag("pseudogenic_exon");
           } elsif ($type eq 'mRNA') {
               $sf->primary_tag("pseudogenic_transcript");
           }
           else {
               $sf->primary_tag("pseudogenic_$type");
           }
       }
   }

After this, you can unflat the structure as pseudogene -> pseudogenic_transcript -> pseudogenic_exon pseudogene -> pseudogenic_tRNA -> pseudogenic_exon

and so on.