Open akeeeshi opened 5 years ago
I'm surprised by your finding. Based on the query below, this appears to be an historical artifact that affected a small number of transcripts loaded before Aug 2016.
I have historical "txinfo" files that contain the data that were actually loaded and the gene is indeed blank in some cases. (BTW, I wish this were NULL and not blank, but that's an aside.)
The txinfo files are made by merging several files from NCBI, one of which is to add gene symbols to the txinfo. Unfortunately, the released files from NCBI were not coordinated back then, so snapshots of those files were sometimes inconsistent. For example, one version of a transcript might be in one file, and a different in another. So, my best guess for what happened here is that these particular transcripts did not have transcript-gene associations at the time. The data loader will handle cases where the transcript exists and the associated gene symbol changes, so I think the cases you found are first created when the gene symbol doesn't exist, and they persist because that transcript was not reloaded before it became deprecated. Furthermore, because it didn't happen at all between 2016-2018, I think something about the process became fixed. This is all a guess of the mechanism.
I am actively thinking about completely rewriting UTA to streamline loading so that it's easier to keep UTA up-to-date.
We could update transcripts to add symbols where missing. Would that be helpful?
Thanks, Reece
anonymous@uta/uta=> select min(added),max(added),count(distinct added) n_dates, count(*) as n_transcripts, hgnc = '' as hgnc_is_blank from uta_20180821.transcript group by 5;
┌────────────────────────────┬────────────────────────────┬─────────┬───────────────┬───────────────┐
│ min │ max │ n_dates │ n_transcripts │ hgnc_is_blank │
├────────────────────────────┼────────────────────────────┼─────────┼───────────────┼───────────────┤
│ 2014-02-11 00:00:18.453854 │ 2018-08-22 08:52:41.710398 │ 22 │ 249873 │ f │
│ 2014-02-11 00:00:18.453854 │ 2016-08-26 16:50:08.119785 │ 4 │ 36 │ t │
└────────────────────────────┴────────────────────────────┴─────────┴───────────────┴───────────────┘
Updating the transcript to add the symbols would be awesome! Would this be part of new UTA release coming down the road?
As a side note, please let us know if there is any way we can be helpful if you choose to rewrite (QA, testing, etc.) This project has been immensely valuable to our organization so we want to contribute back in whatever way seems most helpful.
Definitely part of a new release and carried into future releases.
Thank you for the offer. I will take you up on PRs and offers to help eventually.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been stalled for 7 days with no activity.
This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Our team recently noticed that for a small subset of transcripts within UTA the hgnc field is empty. See entry below comparing the record for the transcript of BRAF vs. MFSD11 gene.
Upon investigating further our team was able to discover another ~36 transcripts that have this issue using a SQL query where hgnc == “”
I was wondering what your thoughts were on what the genesis of this discrepancy could be? It seems as if refseq is up to date in terms of associating this transcript to the MFSD11 gene. Trying my best to ascertain if this something where the source of the UTA data would need to be fixed or an issue with process of how the UTA db is build?
As a note, we also checked this issue in older versions of UTA and they appear to occur there as well.