dieterich-lab / scimodom

Sci- ModoM: A quantitative database of transcriptome-wide high-throughput RNA modification sites
https://dieterich-lab.github.io/scimodom/
GNU Affero General Public License v3.0
0 stars 0 forks source link

Missing annotation for some records #57

Closed eboileau closed 7 months ago

eboileau commented 9 months ago

A clear and concise description of what the bug is.

Records located in intronic regions of a gene on the opposite strand are not annotated. I suspect that such cases are just misannotated, i.e. the bedRMod record is wrong e.g. 10:7596572-7596573 or 10:10426859-10426860.

id chrom      start        end name  score strand gene_name_gc gene_id_gc gene_biotype_gc feature_gc
10    7596572    7596573  m6A   1000      +         None       None            None       None
10   10426859   10426860  m6A   1000      +         None       None            None       None

These are most likely part of some contig, but we do not include contigs in Sci-ModoM.

With the current search query, these records are lost when joining (INNER) GenomicAnnotation, because data.id are non-existant in GenomicAnnotation. We'd need a LEFT OUTER JOIN to recover them, but first we need to sort performance bottlenecks.

Note that such records cannot be entered in GenomicAnnotation, because feature: Mapped[str] = mapped_column(String(32), nullable=False).

Output or error messages.

No response

Additional context

No response

What browser were you using?

Firefox

What version of Sci-ModoM were you using?

dev

eboileau commented 7 months ago

e.g.

10      7596572 7596573 m6A     1000    +       10      ensembl_havana  gene    7559270 7666998 .       -       .       gene_id "ENSG00000123243"; gene_version "15"; gene_name "ITIH5"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
10     7596572 7596573 m6A     1000    +       10      ensembl_havana  transcript      7559270 7666966 .       -       .       gene_id "ENSG00000123243"; gene_version "15"; transcript_id "ENST00000397146"; transcript_version "7"; gene_name "ITIH5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ITIH5-202"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31139"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; transcript_support_level "1 (assigned to previous version 6)";
10     7596572 7596573 m6A     1000    +       10      ensembl_havana  transcript      7562424 7619660 .       -       .       gene_id "ENSG00000123243"; gene_version "15"; transcript_id "ENST00000613909"; transcript_version "4"; gene_name "ITIH5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ITIH5-209"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS31140"; tag "basic"; transcript_support_level "1";
10     7596572 7596573 m6A     1000    +       10      havana  transcript      7571405 7640779 .       -       .       gene_id "ENSG00000123243"; gene_version "15"; transcript_id "ENST00000434980"; transcript_version "5"; gene_name "ITIH5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ITIH5-203"; transcript_source "havana"; transcript_biotype"protein_coding_CDS_not_defined"; transcript_support_level "2";
10     7596572 7596573 m6A     1000    +       10      ensembl transcript      7571405 7666998 .       -       .       gene_id "ENSG00000123243"; gene_version "15"; transcript_id "ENST00000397145"; transcript_version "6"; gene_name "ITIH5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ITIH5-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "basic"; transcript_support_level "2";
10     7596572 7596573 m6A     1000    +       10      havana  transcript      7572772 7622477 .       -       .       gene_id "ENSG00000123243"; gene_version "15"; transcript_id "ENST00000476417"; transcript_version "5"; gene_name "ITIH5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ITIH5-207"; transcript_source "havana"; transcript_biotype"retained_intron"; transcript_support_level "2";
10     7596572 7596573 m6A     1000    +       10      havana  transcript      7576892 7617266 .       -       .       gene_id "ENSG00000123243"; gene_version "15"; transcript_id "ENST00000461751"; transcript_version "1"; gene_name "ITIH5"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "ITIH5-204"; transcript_source "havana"; transcript_biotype"nonsense_mediated_decay"; tag "cds_start_NF"; tag "mRNA_start_NF"; transcript_support_level "5";

If a record does not intersect with annotation with the "correct" strand (except for intergenic), there is not much we can do....