af-lab / histone-catalogue

Core histone catalogue --- Live manuscript
1 stars 0 forks source link

Treatment of genes with multiple transcripts and encoded proteins #17

Closed aflaus closed 7 years ago

aflaus commented 8 years ago

Table S11 currently lists HIST1H2BD, HIST1H2BK, HIST2H2BF with two transcripts/exons. Looking manually, the encoded proteins are identical except for NM_001161334/NP_001154806

This is a problem for alignments and values derived from them (e.g. homology). The most correct solution is to include both. How hard is this to implement?

h2bmultipletranscripts-alignedproteins

carandraug commented 8 years ago

That most correct solution is quite a lot of work since in most cases we are using a hash (dict in python), where the gene symbol names are the keys. This works because all canonical histones are supposed to have only one product (or none in the case of pseudo genes). That will be a very large rewrite of the underlying code.

Anyway, I though our argument was going to be those are annotation errors (and possibly experimental mistakes since we do have EST matches but the histones sequences are too similar).

aflaus commented 8 years ago

I thought you might say that. Can't we just loop over the predicted proteins associated with the gene ID and add each one to the alignment?

Anyway, we need to have a deterministic rule to take one transcript/protein ID predictably and reproducibly? For example, the one with "transcript variant 1" or "isoform a" in the description. Or even the one with the lowest ID - there is often a correlation between wackiness and later identification.

carandraug commented 8 years ago

David will change code to include both transcripts in alignments and list of changes between isoforms

Andrew will write a paragraph about why we think this two transcripts are wrong and why they need our help.

carandraug commented 8 years ago

I finally got to the bottom of this. It is fixed now. The alignments and table with description will mention both proteins products.

Andrew, I'm assigning the rest to you.

aflaus commented 8 years ago

In table 5 we have 3 H2B genes with two transcripts. You have nicely suffixed these with a number. I adjusted the figure legend to reflect this.

I'm going to ignore the need for a specifier in table S2 linking the IDs with variations in tables 5. Since there is only 1 case that is distinctive, anyone interested can just work it out for themselves ...

One final request: Please can we put the gene type/name/UID also for the empty fields in rows with multiple transcripts.

aflaus commented 8 years ago

Just a thought: To be consistent with suggestion of Talbert et al 2012 (their table 1) we should use a "." and not a "-" for the isoform subtypes of H2B in table 5. For example, the should write HIST2H2BF.1 and HIST2H2BF.2

Could you change this please. It should be trivial :-)

carandraug commented 7 years ago

I have made both changes, using a dot instead of a dash, and repeat the histone type/gene symbol/gene uid informations for genes multiple transcripts.