SBRG / bigg_models

The BiGG Models website server
http://bigg.ucsd.edu
Other
80 stars 18 forks source link

Duplicate genes during genome loading #57

Open zakandrewking opened 9 years ago

zakandrewking commented 9 years ago

This happens a lot during genome loading:

INFO:root:Loading genome from genbank file (7 of 150) AM946981.gb
WARNING:root:Duplicate genes B21_02159 on chromosome 8943870
WARNING:root:Duplicate genes B21_02376 on chromosome 8943870
jslu9 commented 9 years ago

It happens whenever a gene has multiple CDS entries. Component loading is parsing CDS by CDS. B21_02159 has two CDS entries:

 gene            2275802..2277159
                 /gene="ybl104"
                 /locus_tag="B21_02159"
 CDS             join(2275802..2276260,2276260..2277159)
                 /gene="ybl104"
                 /locus_tag="B21_02159"
                 /ribosomal_slippage
                 /codon_start=1
                 /transl_table=11
                 /product="ISEcB1 transposase"
                 /protein_id="CBY77859.1"
                 /db_xref="GI:313848701"
                 /db_xref="EnsemblGenomes:B21_02159"
                 /db_xref="EnsemblGenomes:CBY77859"
                 /db_xref="GOA:E5QQH5"
                 /db_xref="InterPro:IPR001584"
                 /db_xref="InterPro:IPR009057"
                 /db_xref="InterPro:IPR011991"
                 /db_xref="InterPro:IPR012337"
                 /db_xref="InterPro:IPR025948"
                 /db_xref="UniProtKB/TrEMBL:E5QQH5"
 CDS             2275802..2276323
                 /gene="ybl104"
                 /locus_tag="B21_02159"
                 /codon_start=1
                 /transl_table=11
                 /product="ISEcB1 protein A"
                 /protein_id="CAQ32676.2"
                 /db_xref="GI:313848700"
                 /db_xref="EnsemblGenomes:B21_02159"
                 /db_xref="EnsemblGenomes:CAQ32676"
                 /db_xref="GOA:C5W711"
                 /db_xref="InterPro:IPR009057"
                 /db_xref="InterPro:IPR011991"
                 /db_xref="UniProtKB/TrEMBL:C5W711"
zakandrewking commented 9 years ago

The only downside to loading CDS's this way is that the leftpos and rightpos correspond to a CDS, not the whole gene. So these need to be fixed eventually.

But things work pretty well right now.

zakandrewking commented 9 years ago

Also, we only look at CDS's right now, so genes without a CDS do not have a match. E.g.:

WARNING:root:Gene not in genbank file: YPL276W from model iMM904
WARNING:root:Gene not in genbank file: YPL275W from model iMM904
WARNING:root:Gene not in genbank file: 57733_AT1 from model RECON1