Open sufenhu9 opened 9 years ago
Relevant feature in the GenBank file is:
...
gene complement(2628722..2629917)
/locus_tag="NCLIV_047911"
/pseudo
CDS complement(join(2628722..2629792,2629795..2629917))
/locus_tag="NCLIV_047911"
/pseudo
/codon_start=1
/product="hypothetical protein"
/db_xref="PSEUDO:CBZ54361.1"
...
not sure why it has changed, @cmungall any ideas?
May try running git bisect
to see where this changed. My feeling is a conversion to add the pseudo
tag to the feature, though it's doing the right thing, actually clobbers the unflattening.
git bisect
indicates the change occurred way back in 2004 in 3a404dfbc73fec21247022d27d29a5c64ac428aa from @cmungall
I'll open a branch for testing fixes.
I'm afraid the exact rationale escapes me at the moment. I assume it was to do with an asymmetry in how pseudogene models were typically encoded in genbank records at the time.
For this particular example, the pseudogene mirrors the structure of a gene precisely, even having a CDS. It looks like the code should be simplified here so that pseudogenes are structured symmetrically to genes. But it's not clear what the consequences of making this change would be for other pseudogene records in genbank. Also, there is a secondary issue that the pseudogene hierarchy doesnt mirror the gene one entirely in SO (e.g. no pseudoCDS)
Thanks for working on this issue. Yes, there is no pseudoCDS in SO so far. There is pseudogenic_exon instead.
For the change in BioPerl 1.6.9
# PSEUDOGENES, PSEUDOEXONS AND PSEUDOINTRONS
# these are indicated with the /pseudo tag
# these are mapped to a different type; they should NOT
# be treated as normal genes
foreach my $sf (@all_seq_features) {
if ($sf->has_tag('pseudo')) {
my $type = $sf->primary_tag;
# SO type is typically the same as the normal
# type but preceeded by "pseudo"
if ($type eq 'misc_RNA' || $type eq 'mRNA') {
# dgg: see TypeMapper; both pseudo mRNA,misc_RNA should be pseudogenic_transcript
$sf->primary_tag("pseudotranscript");
}
else {
$sf->primary_tag("pseudo$type");
}
}
}
I propose the following,
foreach my $sf (@all_seq_features) {
if ($sf->has_tag('pseudo')) {
my $type = $sf->primary_tag;
if ($type eq 'gene') {
$sf->primary_tag("pseudogene");
} elsif ($type eq 'CDS') {
$sf->primary_tag("pseudogenic_exon");
} elsif ($type eq 'mRNA') {
$sf->primary_tag("pseudogenic_transcript");
}
else {
$sf->primary_tag("pseudogenic_$type");
}
}
}
After this, you can unflat the structure as pseudogene -> pseudogenic_transcript -> pseudogenic_exon pseudogene -> pseudogenic_tRNA -> pseudogenic_exon
and so on.
When I use Bio::SeqFeature::Tools::Unflattener to convert GenBank flat-feature-list to containment hierarchy,
Everything is fine except these genes has pseudo tag.
Like to know if there is any other parameter, or any other method that I can use to convert both gene and pseudogene correctly.
sample file
http://www.ncbi.nlm.nih.gov/nuccore/FR823391 http://www.ncbi.nlm.nih.gov/nuccore/GL636509
sample codes
I am using bioperl_live/1.6.9, $ perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"' 1.0069
Everything is fine when I use bioperl version 1.4. Not sure what is changed for Bio::SeqFeature::Tools::Unflattener between 1.4 and 1.6.9.
The different outputs with above code in different bioperl version 1.6.9 and 1.4.
Thanks.