clulab / bioresources

Data resources from the biomedical domain
Apache License 2.0
3 stars 1 forks source link

Update to latest FamPlex #22

Closed MihaiSurdeanu closed 4 years ago

MihaiSurdeanu commented 4 years ago

Hi @bgyori and @johnbachman,

This is about updating bioresources to the latest FamPlex: https://github.com/sorgerlab/famplex I think I will need some help from you so I don't screw things up.

First, let me explain what we use from this data in bioresources:

  1. We have a list of known complex names in BEcomplexes.tsv (names can be changed). The format is one name per line. Each line has: name \t grounding-id. For example:

    (9-1-1) complex 9_1_1
    9-1-1   9_1_1
    9-1-1 complex   9_1_1
    9_1_1   9_1_1
    A2t     Annexin_II_heterotetramer
    ACC     ACC
    ...
  2. We have a list of protein families in BEfamilies.tsv in the same format:

    14-3-3 proteins p14_3_3
    4EBP    EIF4EBP
    a-adrenoreceptor        ADRA
    a-cathenin      CTNNA
    ABL     ABL_family
    ...
  3. We have an "override" file of grounding ids, which is preferred over all KBs. This file includes some entries from FamPlex (formerly bioentities). The file is called NER-Grounding-Override.tsv, and its format is: name \t grounding-id \t originating-KB \t type. In this file we have several entries from you, which were added to override protein names. For example:

    E2F     E2F             be      Family
    ERK     ERK             be      Family
    Fos     FOS_family              be      Family
    GST     GST             be      Family
    INS     INS             be      Family
    p38     p38             be      Family
    ...
MihaiSurdeanu commented 4 years ago

What is the best way to re-generate these files from FamPlex? It seems that grounding_map.csv is the closest. But I do not see an obvious way to identify the types of these entities. We need to pull out the complexes and families in separate files, as mentioned above.

Also, we probably should update the override file as well. Do you know which of these complex/family names overlap with Uniprot? And which should be prioritized over Uniprot? Thanks!

johnbachman commented 4 years ago

Hi Mihai, it would be pretty straightforward to generate these files from the grounding map--basically we would just iterate over the entries in the grounding map and apportion the entries into the BEcomplexes, BEfamilies, and overrides files, depending on whether the string in the grounding map is grounded to a family, complex or other (protein, chemical, metabolite, biological process, etc.). We can add the generation of these files to our FamPlex export process.

MihaiSurdeanu commented 4 years ago

That would be awesome. Thanks!

bgyori commented 4 years ago

One thing to consider is that the distinction between families and complexes is somewhat blurred in FamPlex. There are entries in FamPlex that are both families and complexes (e.g., NFkappaB is a family of complexes). If there isn't an important reason to separate families and complexes, we could just use a single flat list. Otherwise, it would be useful to better understand where the family/complex distinction plays a role in REACH to be able to implement the export logic appropriately.

MihaiSurdeanu commented 4 years ago

The distinction is not important to Reach. Event rules operate over parent types of both complexes and families, so they are not affected by the confusion. So, if it doesn't matter to you, it doesn't matter to us. Maybe we can create a new type for both, e.g., ComplexOrFamily?

bgyori commented 4 years ago

Here is the first iteration of the export:

Some potential issues to consider:

MihaiSurdeanu commented 4 years ago

Thanks @bgyori !

A few comments:

MihaiSurdeanu commented 4 years ago

Let me know if anything in the above is not clear!

bgyori commented 4 years ago

Got it, will fix these issues and some others I found.

bgyori commented 4 years ago

A few more things that come to mind:

MihaiSurdeanu commented 4 years ago
bgyori commented 4 years ago

This is done, though restricted to importing only families/complexes from FamPlex. Separately, some of the non-family/complex grounding overrides could be imported into the overrides list here.