PROconsortium / PRoteinOntology

Other
14 stars 3 forks source link

Implement use of sequence file for Interactive Sequence Alignment #214

Open nataled opened 3 years ago

nataled commented 3 years ago

For the alignment, it is desired that we have sequences for all eligible terms, and that these sequences reflect certain modifications to sequence such as removed signal peptides. Such information is currently provided by /data/pir/projects/pro/current/for_release/files_for_julie/seqforpro.seq but this latter file is incomplete with respect to sequences represented.

Note: this issue is dependent upon the PRO_sequences.fa file created in response to issue #193 and that will henceforth be available with each release in the appropriate internal pre-release folder /data/pir/projects/pro/releaseNN

Assignment to @hongzhanhuang to ensure this file is copied to the correct folder.

nataled commented 3 years ago

@Julie-Cowart will need to verify that the file generated by Hongzhan (seqforpro.seq) is not used elsewhere. Will also need to follow HOW that file is used in the database.

nataled commented 3 years ago

Examples are needed!

Julie-Cowart commented 3 years ago

Issues this change should address (with examples):

Discussion points

nataled commented 3 years ago

Options for display after alignment: 1) Put gaps where sequence is removed - consider that large N-terminal removals will need long right-scrolls to see anything. 2) Show only alignment as would normally be viewed. 3) Just label the alignment positions as if full sequence is there.

Examples of a deletion within the sequence: https://proconsortium.org/cgi-bin/textsearch_pro?searchtype2=&searchtype=&tmp=&search.x=17&search.y=6&field0=ALLFLDS&query0=presenilin-1&andor1=AND&field1=ALLFLDS&query1=variant&andor2=AND&field2=ALLFLDS&query2=del

Suggest that we try aligning three types of sequence: 1) full UniProtKB sequence (as before) 2) with dashes in gaps 3) without dashes in gaps (pre-removal)

Julie-Cowart commented 3 years ago

So before I can even get to how best to display the gaps, I find out that the alignment itself doesn't work in some cases, particularly large cleavages. https://proconsortium.org/app_test/entry/PR:000050266/ doesnt align the way it should. (Note I actually had to change the code to force the current term to be shown with its children which I think is a seperate bug but the fix needs testing). The 4 terms are all just subsequences of the same uniprot entry so really no alignment should be needed since the proteoform def should be all we need. It looks like the seq objet with gaps displays the way we would want but the seqs that com back after alignment have no leading or trailing gaps and may not align right.

PR:000050266 UniProtKB:P0DTC2, 13-1273 last two children are wrong and leading 1-12 aren't displayed as missing PR:000050267 UniProtKB:P0DTC2, 13-685 shows self with proper gaps PR:000050268 UniProtKB:P0DTC2, 686-1273 shows self and child with proper alignment but no gap 1-12 PR:000050269 UniProtKB:P0DTC2, 816-1273 shows self with proper gap

Julie-Cowart commented 3 years ago

We decided to try leaving the alignment as is and just introducing the gaps from cleavages, etc in the decoration. This is now implemented and needs testing. It should address sub sequence variants (like SigPep- or InitMet-) and variants with deletions.

nataled commented 3 years ago

Some initial tests done.

1) This fails: PR:000050278 - sequence that isn't in seqforpro.seq or didn't have a clear xref but is in PRO_sequences.fa (there are about 70 such cases)

For the above we'll work out a regular upload of just the missing or problematic cases.

2) This works, but with mild (possibly ignorable) issues: PR:000050266 - summary alignment shows gap when none exists at the zoomed view

For the above we'll not worry about it. Will revisit if there's a user complaint, but the expected number of cases where we have 'adjacent' proteoforms is low, and the extra gap can't be predicted since it's based on sequence length of parent.

3) Though this page times out, it does point out that there are cases where a PRO-proteoform-std line exists that is not also reflected on the definition line. https://proconsortium.org/app_test/entry/PR:P20039. (I note that the current web site just returns a 'No sequence found' error). The stanza (trimmed) is:

[Term]
id: PR:P20039
name: HLA class II histocompatibility antigen, DRB1-11 beta chain (human)
def: "An HLA class II histocompatibility antigen, DRB1 beta chain (human) that belongs to allele group 11." 
comment: Category=organism-seqgroup. Note: Prototype allele is HLA-DRB*11:01. Sequence removed from UniProtKB.
synonym: "UniProtKB:P01911, Lys-5/Ala-100/Leu-162/Thr-262, CHEBI:29952|Thr-13/Ser-29, CHEBI:46217|Ala-14/Met-171, CHEBI:30015|Trp-38/Ala-87, CHEBI:29972|Gln-39/Ser-66, CHEBI:46858|Pro-40/Arg-42, CHEBI:29999|Lys-41/Ala-169, CHEBI:30013|Ile-96, CHEBI:29997|Gln-99, CHEBI:29958|Val-115, CHEBI:29947|Gln-125/Gln-178, CHEBI:29979" EXACT PRO-proteoform-std [PRO:DNx]
is_a: PR:P01911 ! HLA class II histocompatibility antigen, DRB1 beta chain (human)

If not already done, this means that we should probably also have a fall-back where if there is no xref and no proteoform indicated within the definition, then look at the PRO-proteoform-std line (giving preference to EXACT versions when available).

4) There are a few cases where there are TrEMBL entries that can be described in terms of Swiss-Prot entries. These are usually (always?) sequence variants. In such cases, there is both an xref and a proteoform indicated on the definition line. Using the xref will require alignment, while the proteoform might not. Here are two examples:

PR:A2AHT3 UniProtKB:A2AHT3 UniProtKB:Q9QXT8-3, Gly-41, CHEBI:29952 PR:B5B3R8 UniProtKB:B5B3R8 UniProtKB:P02662, Glu-99, CHEBI:29958|Glu-207, CHEBI:29947

Julie-Cowart commented 3 years ago

For 1 it would work if we were processing the PRO-proteoform-std line and know how to handle the fact that it has more than one reference separated by the ;. Is the sequence the same either way? If so we could make this work but unless this also applies to several of the other special cases then maybe not bother and just have the PRO_sequences.fa fallback. We can review each of the 70 cases more closely if needed.

For 2 agreed.

For 3 I will work on the addition of using the PRO-proteoform-std line instead of the definition line. We currently use just the definition. We can either use just the PRO-proteoform-std line instead or use both with one as the fall back for the other (an in that case which should have precedence). To do this properly at obo file parse time, the easiest is to replace the definition parsing with the synonym parsing and then the data format in the db is the same just loaded from a different place. If we want to use both mechanisms then I may just leave the parsing as is and only parse the PRO-proteoform-std as needed in the MSA view codebase.

For 4 i didn't know there were such cases but it appears to show in the MSA properly. I agree that the only difference is that the proteoform version may result in less need for alignment when show with siblings that share the same uniprot term but this is largely only a potential performance issue.

Julie-Cowart commented 3 years ago

The changes to the MSA that show gaps for terms with protein features that indicate a cleaved product is now in production (see https://proconsortium.org/app/entry/PR:000050266/). This issue can stay open to address the issue of using PRO-proteoform-std line instead of the definition line and also the use of the sequence file for the 70 or special cases.