biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
233 stars 94 forks source link

feat(#741): Pretty print #742

Open andreasprlic opened 3 weeks ago

andreasprlic commented 3 weeks ago

This (draft) PR is an attempt to add a context-visualization framework on top of the tooling that we already have in hgvs. Unless somebody has a better suggestion for a name, for now I called it "pretty_print".

Features:

1) It provides a datacompiler class that merges together all the data needed for the visualization. (no need for uta_align here). This could be used e.g. by a future effort to developer alternative visualisations such as SVG graphics, or hooked up on a web site. 2) For now there is only a text-based rendering framework that has been added here. It offers a set of elemental "renderer" objects that are responsible for providing the text for one line in the final text. 3) There are several unit tests that contain variants with various features. They are also the reason why this PR is still only in draft. The fixture that gets compiled for this unit test (tests/cache-py3.hdp) becomes really big and some suggestions for what to do about this would be helpful. Perhaps disable most of the tests? Or offer an ipython runbook with examples? Any feedback/suggestion is welcome here. 4) There's a configuration class where some parameters around the display can be modified 5) On a different thread we were discussing improvements around repeats. I wrote a basic repeat-detection script, and added the result of that as a FYI to the visualization here too, so we can see how well this works.

Examples:

In [1]: hgvs_g = "NC_000005.10:g.123346517_123346518insATTA"

In [2]: var_g = parse(hgvs_g)

In [3]: print(pretty.display(var_g))
hgvs_g    : NC_000005.10:g.123346517_123346518insATTA
hgvs_c    : NM_001166226.1:c.*1_*2insTAAT
hgvs_p    : NP_001159698.1:p.?
          :   123,346,500         123,346,520         123,346,540
chrom pos :   |    .    |    .    |    .    |    .    |    .
seq    -> : ATAAAGCTTTTCCAAATGTTATTAATTACTGGCATTGCTTTTTGCCAA
region    :                     |------|
tx seq <- : TATTTCGAAAAGGTTTACAATAATTAATGACCGTAACGAAAAACGGTT
tx pos    :  |    .    |    .   |   |    .    |    .    |
          :  *20       *10      *1  2880      2870      2860
aa seq <- :                      TerAsnSerAlaAsnSerLysAlaLeu
aa pos    :                         |||            ...
          :                         960
ref>alt   : ATTA[2]>ATTA[3]

Here also a screenshot how on the terminal the text can be color coded:

Screenshot 2024-06-30 at 22 53 19
andreasprlic commented 2 days ago

Note: this PR adds two new unit tests. They are pulling a lot of data from UTA and seqrepo, and as such they are slow. Should I tag them and exclude from the CI runs? Also, I did not want to upload an updated cache file for these tests, because the size of the test cache would have jumped from 1MB to ~600MB.