Closed MihaiSurdeanu closed 4 years ago
What is the best way to re-generate these files from FamPlex? It seems that grounding_map.csv
is the closest. But I do not see an obvious way to identify the types of these entities. We need to pull out the complexes and families in separate files, as mentioned above.
Also, we probably should update the override file as well. Do you know which of these complex/family names overlap with Uniprot? And which should be prioritized over Uniprot? Thanks!
Hi Mihai, it would be pretty straightforward to generate these files from the grounding map--basically we would just iterate over the entries in the grounding map and apportion the entries into the BEcomplexes, BEfamilies, and overrides files, depending on whether the string in the grounding map is grounded to a family, complex or other (protein, chemical, metabolite, biological process, etc.). We can add the generation of these files to our FamPlex export process.
That would be awesome. Thanks!
One thing to consider is that the distinction between families and complexes is somewhat blurred in FamPlex. There are entries in FamPlex that are both families and complexes (e.g., NFkappaB is a family of complexes). If there isn't an important reason to separate families and complexes, we could just use a single flat list. Otherwise, it would be useful to better understand where the family/complex distinction plays a role in REACH to be able to implement the export logic appropriately.
The distinction is not important to Reach. Event rules operate over parent types of both complexes and families, so they are not affected by the confusion. So, if it doesn't matter to you, it doesn't matter to us. Maybe we can create a new type for both, e.g., ComplexOrFamily?
Here is the first iteration of the export:
Some potential issues to consider:
ER»·ESR»fplx»···FamilyOrComplex
ER»·GO:0005783»·go»·Simple_chemical
ER»·P03372»·uniprot»Gene_or_gene_product
what should we do with these for the purpose of this export?
Thanks @bgyori !
A few comments:
Let me know if anything in the above is not clear!
Got it, will fix these issues and some others I found.
A few more things that come to mind:
This is done, though restricted to importing only families/complexes from FamPlex. Separately, some of the non-family/complex grounding overrides could be imported into the overrides list here.
Hi @bgyori and @johnbachman,
This is about updating bioresources to the latest FamPlex: https://github.com/sorgerlab/famplex I think I will need some help from you so I don't screw things up.
First, let me explain what we use from this data in bioresources:
We have a list of known complex names in
BEcomplexes.tsv
(names can be changed). The format is one name per line. Each line has:name \t grounding-id
. For example:We have a list of protein families in
BEfamilies.tsv
in the same format:We have an "override" file of grounding ids, which is preferred over all KBs. This file includes some entries from FamPlex (formerly bioentities). The file is called
NER-Grounding-Override.tsv
, and its format is:name \t grounding-id \t originating-KB \t type
. In this file we have several entries from you, which were added to override protein names. For example: