Clinical-Genomics / schug

All you need to know about genes, transcripts and exons
https://clinical-genomics.github.io/schug/
0 stars 0 forks source link

Exons not showing on gene report #74

Open Jakob37 opened 1 day ago

Jakob37 commented 1 day ago

Describe the bug

I am test running Chanjo2 in preparation for further demonstrating and discussing (finally, hopefully) putting it into production.

Reports looks fine, and MANE info looks good. But I no longer see any exon information:

mane_report_1

Clicking into a gene, it say "no exons stats available for this transcript"

no_exon_stats

Looking in the MariaDB, it indeed looks like the exons are loaded:

MariaDB [chanjo2]> select * from exons limit 10;
+----+------------+---------+---------+--------------------+-----------------+-----------------------+-----------------+--------+
| id | chromosome | start   | stop    | rank_in_transcript | ensembl_id      | ensembl_transcript_id | ensembl_gene_id | build  |
+----+------------+---------+---------+--------------------+-----------------+-----------------------+-----------------+--------+
|  1 | 1          | 3069168 | 3069296 |                  1 | ENSE00002048533 | ENST00000511072       | ENSG00000142611 | GRCh38 |
|  2 | 1          | 3186125 | 3186474 |                  2 | ENSE00001754112 | ENST00000511072       | ENSG00000142611 | GRCh38 |
|  3 | 1          | 3244087 | 3244137 |                  3 | ENSE00003480863 | ENST00000511072       | ENSG00000142611 | GRCh38 |
|  4 | 1          | 3385149 | 3385286 |                  4 | ENSE00002034212 | ENST00000511072       | ENSG00000142611 | GRCh38 |
|  5 | 1          | 3396491 | 3396593 |                  5 | ENSE00003700221 | ENST00000511072       | ENSG00000142611 | GRCh38 |
|  6 | 1          | 3402791 | 3402998 |                  6 | ENSE00003696962 | ENST00000511072       | ENSG00000142611 | GRCh38 |
|  7 | 1          | 3404739 | 3404886 |                  7 | ENSE00003700688 | ENST00000511072       | ENSG00000142611 | GRCh38 |
|  8 | 1          | 3405495 | 3405648 |                  8 | ENSE00003700645 | ENST00000511072       | ENSG00000142611 | GRCh38 |
|  9 | 1          | 3411384 | 3412800 |                  9 | ENSE00003695658 | ENST00000511072       | ENSG00000142611 | GRCh38 |
| 10 | 1          | 3414560 | 3414647 |                 10 | ENSE00003701451 | ENST00000511072       | ENSG00000142611 | GRCh38 |
+----+------------+---------+---------+--------------------+-----------------+-----------------------+-----------------+--------+
10 rows in set (0.001 sec)

This confuses me a bit, as it apparently worked for me back when reviewing this PR: https://github.com/Clinical-Genomics/chanjo2/pull/369, and I don't think much has changed since then 🤔

Additional context

I have tested this both in Chanjo v2.0 and v2.1. I am running this with Scout v4.90.1.

Might very well be something messed up on our side here. Unsure what though. Debugging pointers are welcome!

northwestwitch commented 1 day ago

Weird, perhaps something happened after that PR, I'll look into it today!

northwestwitch commented 1 day ago

Ah but wait, exons did NOT show in MANE report, only on gene report, but I guess that's what you mean?

Jakob37 commented 1 day ago

but I guess that's what you mean?

Not 100% sure what I mean 🤔

But I realized now when looking at some more genes that the exons are there and thus Chanjo2 seems to be doing its job:

exons_there_now

Looks like I just was unlucky opening ACBT (screenshot above) which for some reason did not have exons:

no_exon_stats
northwestwitch commented 1 day ago

Ah good! 😄 Have a great day!

Jakob37 commented 1 day ago

Ah good! 😄 Have a great day!

Looks like something in my db, either on ensembls part, or maybe more likely that my db is out of sync somehow:

MariaDB [chanjo2]> select * from transcripts where ensembl_id="ENST00000414620";
+--------+------------+---------+---------+-----------------+-------------+------------------+--------------+--------------------+---------------------------+-----------------+--------+
| id     | chromosome | start   | stop    | ensembl_id      | refseq_mrna | refseq_mrna_pred | refseq_ncrna | refseq_mane_select | refseq_mane_plus_clinical | ensembl_gene_id | build  |
+--------+------------+---------+---------+-----------------+-------------+------------------+--------------+--------------------+---------------------------+-----------------+--------+
| 669677 | 7          | 5529282 | 5562790 | ENST00000414620 | NULL        | NULL             | NULL         | NULL               | NULL                      | ENSG00000075624 | GRCh38 |
+--------+------------+---------+---------+-----------------+-------------+------------------+--------------+--------------------+---------------------------+-----------------+--------+
1 row in set (0.030 sec)

MariaDB [chanjo2]> select * from exons WHERE ensembl_transcript_id="ENST00000414620";
Empty set (0.000 sec)

You too 😊

Jakob37 commented 1 day ago

Actually, digging around into this a bit more, things still seem weird, but on a more upstream level (i.e. Schug or ENSEMBL).

Take the two transcripts I mentioned above.

actb actn2

First I thought I might have truncated the exon file which I loaded manually. So I reran the Schug download steps for exons and transcripts.

curl localhost:8037/exons/ensembl_exons/?build=38 > ensembl_exons.tsv
curl localhost:8037/transcripts/ensembl_transcripts/?build=38 > ensembl_transcripts.tsv

These completed fine, with the same md5sums as those I previously loaded into Chanjo2.

Now looking for the transcript IDs above:

$ grep ENST00000414620 ensembl_transcripts.tsv
7       ENSG00000075624 ENST00000414620 5529282 5562790
$ grep ENST00000414620 ensembl_exons.tsv
(no output)
$ grep ENST00000366578 ensembl_transcripts.tsv
1       ENSG00000077522 ENST00000366578 236686499       236764631       NM_001103                       NM_001103.4
jakob@laptop:~/data/241127_schug_ensembl$ grep ENST00000366578 ensembl_exons.tsv
1       ENSG00000077522 ENST00000366578 ENSE00003612377 236718894       236719013                                       1       3
1       ENSG00000077522 ENST00000366578 ENSE00001820573 236686499       236686799       236686499       236686673                       1       1
1       ENSG00000077522 ENST00000366578 ENSE00003611529 236717858       236717972                                       1       2
1       ENSG00000077522 ENST00000366578 ENSE00003535405 236720105       236720191                                       1       4
1       ENSG00000077522 ENST00000366578 ENSE00003553097 236725933       236726020                                       1       5
...

Next, I went to ensembl for these transcripts. Here I find exons with exon IDs for both.

ENST00000414620

https://www.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000075624;r=7:5526409-5563902;t=ENST00000414620

ensembl_41420

ENST00000366578

http://www.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000077522;r=1:236664141-236764631;t=ENST00000366578

366578

So it seems that there are exons, but I don't get them through schug? Could you check if you have exons for the corresponding transcript?

northwestwitch commented 1 day ago

Mmm I confirm that it might be a schug/Ensembl thing. The exons are available in build 37, but not 38:

image
northwestwitch commented 1 day ago

Moving this issue to schug then. I'll look into it!

Jakob37 commented 1 day ago

Moving this issue to schug then. I'll look into it!

Thanks 🙏

northwestwitch commented 1 day ago

I've sent the following email to Ensembl, let's see what they reply!

Hello, I'm trying to figure out why we have a bug in our software that downloads transcripts data from Ensembl Biomart (human data).

Specifically, we are missing the 4 exons relative to this transcript: ENST00000414620

The exons are there if you look at the web page: https://www.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000075624;r=7:5529282-5562790;t=ENST00000414620

But are not downloaded using all exons via biomart. URL used in Biomart is the following:

https://www.ensembl.org/biomart/martservice?query=%3C?xml%20version=%221.0%22%20encoding=%22UTF-8%22?%3E%3C!DOCTYPE%20Query%3E%3CQuery%20%20virtualSchemaName%20=%20%22default%22%20formatter%20=%20%22TSV%22%20header%20=%20%221%22%20uniqueRows%20=%20%220%22%20count%20=%20%22%22%20datasetConfigVersion%20=%20%220.6%22%20completionStamp%20=%20%221%22%3E%3CDataset%20name%20=%20%22hsapiens_gene_ensembl%22%20interface%20=%20%22default%22%20%3E%3CFilter%20name%20=%20%22chromosome_name%22%20value%20=%20%221,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,MT%22/%3E%3CAttribute%20name%20=%20%22chromosome_name%22%20/%3E%3CAttribute%20name%20=%20%22ensembl_gene_id%22%20/%3E%3CAttribute%20name%20=%20%22ensembl_transcript_id%22%20/%3E%3CAttribute%20name%20=%20%22ensembl_exon_id%22%20/%3E%3CAttribute%20name%20=%20%22exon_chrom_start%22%20/%3E%3CAttribute%20name%20=%20%22exon_chrom_end%22%20/%3E%3CAttribute%20name%20=%20%225_utr_start%22%20/%3E%3CAttribute%20name%20=%20%225_utr_end%22%20/%3E%3CAttribute%20name%20=%20%223_utr_start%22%20/%3E%3CAttribute%20name%20=%20%223_utr_end%22%20/%3E%3CAttribute%20name%20=%20%22strand%22%20/%3E%3CAttribute%20name%20=%20%22rank%22%20/%3E%3C/Dataset%3E%3C/Query%3E

Which is basically using the following attributes:

attributes = [
        "chromosome_name",
        "ensembl_gene_id",
        "ensembl_transcript_id",
        "ensembl_exon_id",
        "exon_chrom_start",
        "exon_chrom_end",
        "5_utr_start",
        "5_utr_end",
        "3_utr_start",
        "3_utr_end",
        "strand",
        "rank",
    ]

and all chromosomes as filters.

I noticed that when I include the ensembl gene ID (ENSG00000075624) among the filters, then the 4 exons are downloaded

They are also downloaded when I use the Biomart genome build 37.

Thank you so much for your help!
Chiara