PathwayCommons / factoid

A project to capture biological pathway data from academic papers
https://biofactoid.org
MIT License
28 stars 7 forks source link

PC-2-factoid: Supplementing the Factoid database with curated interactions #836

Closed jvwong closed 3 years ago

jvwong commented 3 years ago

Description

Q: What is the name of the feature?

A: Add to Factoid database documents, created from an external (existing) database

Q: What does this feature enable the user to do?

A: Access more (expert-)curated interactions supported by literature via Factoid

Q: What are the applicable constraints, e.g. compatibility or performance?

Q: How does this feature affect each class of user (persona)?

There are a few benefits to introducing external data into Factoid database:

a) 'Frame-out' the Factoid app with useful, quality, curated data b) A lot is data for older articles that wouldn’t otherwise be curated c) Can serve as pre-made Factoids that we can ask the authors to ‘verify’ d) Begins the idea of unifying PC and Factoid (search/import)

Specification

Details

  1. Fetch data file

My first thought is to use PhosphositePlus (PSP) PC BioPAX

  1. Filter Interactions
  1. Group interactions

Factoid documents are one-to-one with articles, and interactions are added to an article/document. So in this case, interactions would have to be grouped under the same PMID. Not sure how many PSP interactions fall under the same PMID, and I wouldn't merge genes/interactions in output

  1. Map to document JSON

https://github.com/PathwayCommons/factoid/blob/d145d5735e34e719acd8ed380786e9f6bec79a86/src/server/routes/api/document/index.js#L1239-L1265

  1. Load into Factoid

Notes

metincansiper commented 3 years ago

@jvwong Is the evidence code that you mention like the eco in the following?

<bp:RelationshipXref rdf:ID="RelationshipXref_3595accc-f8ba-485c-abae-5b241ba43b55http___www_humanmetabolism_org__relationshipxref_1089218069">
 <bp:comment rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">REPLACED http://pathwaycommons.org/pc12/RelationshipXref_3595accc-f8ba-485c-abae-5b241ba43b55http___www_humanmetabolism_org__relationshipxref_1089218069</bp:comment>
 <bp:id rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">ECO:0000000</bp:id>
 <bp:db rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">evidence code ontology</bp:db>
</bp:RelationshipXref>
jvwong commented 3 years ago

@jvwong Is the evidence code that you mention like the eco in the following?

<bp:RelationshipXref rdf:ID="RelationshipXref_3595accc-f8ba-485c-abae-5b241ba43b55http___www_humanmetabolism_org__relationshipxref_1089218069">
 <bp:comment rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">REPLACED http://pathwaycommons.org/pc12/RelationshipXref_3595accc-f8ba-485c-abae-5b241ba43b55http___www_humanmetabolism_org__relationshipxref_1089218069</bp:comment>
 <bp:id rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">ECO:0000000</bp:id>
 <bp:db rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">evidence code ontology</bp:db>
</bp:RelationshipXref>

Not sure what you are asking. The evidence codes for PhosphoSitePlus are from MI ontology

<bp:EvidenceCodeVocabulary rdf:ID="EvidenceCodeVocabulary_8e1e80bfab79984903b7e3804345048e">
 <bp:xref rdf:resource="#UnificationXref_molecular_interactions_ontology_MI_0421" />
 <bp:term rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">identification by antibody</bp:term>
</bp:EvidenceCodeVocabulary>

...

<bp:EvidenceCodeVocabulary rdf:ID="EvidenceCodeVocabulary_71fe7be44e80879bb8505ec68b171f5c">
 <bp:xref rdf:resource="#UnificationXref_molecular_interactions_ontology_MI_0113" />
 <bp:term rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">western blot</bp:term>
</bp:EvidenceCodeVocabulary>
metincansiper commented 3 years ago

Not sure what you are asking. The evidence codes for PhosphoSitePlus are from MI ontology

Okay, actually it was obvious. I was actually looking for something similar but for some reason I got confused later. Now, I see that.

metincansiper commented 3 years ago

As reflected in biopax section of factoid binary interactions document: to reflect a factoid interaction we sometimes create a biopax interaction and another one that is the controlled of that interaction. Therefore, I selected the biopax interactions who is not the controlled of another biopax interaction as my root interactions. Then, looked for some patterns starting from these roots. Therefore, I also make the evidence/evidence code based filtering on these root interactions.

The problem is that: in the PhosphoSitePlus biopax file there are some interactions associated with evidences but all of these are controlled of some other interactions. In another word, none of the root interactions have any evidence. Therefore, I end up with having no interaction passing through the filter.

Does my approach to get started with root interactions sounds right? If that sounds right should I consider checking the evidence of controlled interactions as well when the root interaction does not have any evidence?

maxkfranz commented 3 years ago

A lot of candidates probably are controller-controlled. Have you revisited the 'Factoid binary interaction types' document. That describes the conversion rules. You'd just have to go in the opposite direction.

jvwong commented 3 years ago

PSP puts all the PublicationXref and Evidence attributes on the controlled interaction.

metincansiper commented 3 years ago

A lot of candidates probably are controller-controlled. Have you revisited the 'Factoid binary interaction types' document. That describes the conversion rules. You'd just have to go in the opposite direction.

Yes, I did so.

PSP puts all the PublicationXref and Evidence attributes on the controlled interaction.

Okay, I was thinking that they would be in the controller. Then, I can check the controlled interaction.

metincansiper commented 3 years ago

PSP puts all the PublicationXref and Evidence attributes on the controlled interaction.

Considering the factoid binary interaction types there is a type of MolecularInteraction which cannot have a controlled. In this case if I eliminate the interactions that does not have evidence then I will always be skipping MolecularInteraction. Also, since you mentioned that the PublicationXref is also stored in the controller, we will be skipping that anyways. Should I delete the case that considers the MolecularInteraction since it will never work in practice?

jvwong commented 3 years ago

PSP puts all the PublicationXref and Evidence attributes on the controlled interaction.

Considering the factoid binary interaction types there is a type of MolecularInteraction which cannot have a controlled. In this case if I eliminate the interactions that does not have evidence then I will always be skipping MolecularInteraction. Also, since you mentioned that the PublicationXref is also stored in the controller, we will be skipping that anyways. Should I delete the case that considers the MolecularInteraction since it will never work in practice?

The important criteria are:

jvwong commented 3 years ago

I've tried to configure the 'master.factoid.baderlab.org' instance to the best of my knowledge but you should double check:

https://github.com/BaderLab/sysadmin/blob/master/websites/factoid.md#master-instance-settings

maxkfranz commented 3 years ago

Looks OK

jvwong commented 3 years ago

@metincansiper I am trying to test this locally. Is there any specific setup requirements that you have used to test this?

This is mine:

factoid branch: unstable
factoid-converters branch: master
grounding-search branch: master
URL: https://www.pathwaycommons.org/archives/PC2/v12/PathwayCommons12.psp.BIOPAX.owl.gz

I was getting some weird errors (creating a bunch of 'secret' tables) so just need to know if there are any details.

jvwong commented 3 years ago

There two errors I'm seeing upon POST to /api/document:

  1. Freshly created database / no documents exists:

In this case the app throws a bunch of Reql errors and seems to try to create multiple secret tables.

  1. If at least one existing document in database:
info:    Updating document-level related papers for doc 77246bc4-58a0-45b5-9cb7-dc27912ebd15
info:    POST /api/document 200 3208.757 ms - 36332
error:   Error getRelPprsForDoc: HTTPStatusError: Too Many Requests (429)
info:    Updating network-level related papers for doc 77246bc4-58a0-45b5-9cb7-dc27912ebd15
error:   Aggregate get failed
error:    FetchError: invalid json response body at http://localhost:3011/get reason: Unexpected end of JSON input
    at /.../factoid/node_modules/node-fetch/lib/index.js:272:32
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async Promise.all (index 0)
....
metincansiper commented 3 years ago

@jvwong I see a similar error. Something maybe broken I am looking into it.

metincansiper commented 3 years ago

@jvwong I found out the problem causing the error of FetchError: invalid json response body at http://localhost:3011/get reason: Unexpected end of JSON input. Grounding search get service is giving error for some option/options. The one I detected is {"id":"P29678","namespace":"uniprot"};. It works well for the options {"id":"16818","namespace":"ncbi"}. I do not know if it is a problem specific to uniprot entities.

Would there be an issue with the grounding search service. I used the following code to try things out independent from the factoid code:

const fetch = require('node-fetch');
// working:
const opts = {"id":"16818","namespace":"ncbi"};
// error:
// const opts = {"id":"P29678","namespace":"uniprot"};
const url = 'http://localhost:3002/get';
fetch( url, {
  method: 'POST',
  body: JSON.stringify(opts),
  headers: {
    'Content-Type': 'application/json'
  }
} )
.then( res => res.json() )
.then( console.log );
jvwong commented 3 years ago

The UniProt accession P29678 points to a organism 'rabbit' https://www.uniprot.org/uniprot/P29678 which is not among the supported organisms. The stuff from PSP BioPAX is all human, so looks like some mapping went wrong?

metincansiper commented 3 years ago

The UniProt accession P29678 points to a organism 'rabbit' https://www.uniprot.org/uniprot/P29678 which is not among the supported organisms. The stuff from PSP BioPAX is all human, so looks like some mapping went wrong?

@jvwong since it has the uniprot namespace it must not be the result of mapping but the it must be the input for the mapping. Therefore, I suspect if it would actually be coming from PSP file in a way. However, the file cannot be downloaded now. Is it related to the network issue that you mentioned?

jvwong commented 3 years ago

Yeah all our VMs are totally messed up including PC. Ill let you know when its back up but might be a few days?

On Mon, Jan 4, 2021 at 4:11 PM metincansiper notifications@github.com wrote:

The UniProt accession P29678 points to a organism 'rabbit' https://www.uniprot.org/uniprot/P29678 which is not among the supported organisms. The stuff from PSP BioPAX is all human, so looks like some mapping went wrong?

@jvwong https://github.com/jvwong since it has the uniprot namespace it must not be the result of mapping but the it must be the input for the mapping. Therefore, I suspect if it would actually be coming from PSP file in a way. However, the file cannot be downloaded now. Is it related to the network issue that you mentioned?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PathwayCommons/factoid/issues/836#issuecomment-754221557, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD5AA4IIIHKT72URNDMOILSYIVJ3ANCNFSM4RU4P5ZQ .

metincansiper commented 3 years ago

The UniProt accession P29678 points to a organism 'rabbit' https://www.uniprot.org/uniprot/P29678 which is not among the supported organisms. The stuff from PSP BioPAX is all human, so looks like some mapping went wrong?

@jvwong I found it in PSP Biopax file:

<bp:UnificationXref rdf:ID="UnificationXref_uniprot_knowledgebase_P29678">
 <bp:id rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">P29678</bp:id>
 <bp:db rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">uniprot knowledgebase</bp:db>
</bp:UnificationXref>
jvwong commented 3 years ago

OK I see an example of this: https://apps.pathwaycommons.org/pathways?uri=http://pathwaycommons.org/pc12/Catalysis_61738de4a156b1f6b48691e258df24bb

metincansiper commented 3 years ago

OK I see an example of this: https://apps.pathwaycommons.org/pathways?uri=http://pathwaycommons.org/pc12/Catalysis_61738de4a156b1f6b48691e258df24bb

@jvwong then should we just skip the non human genes?

jvwong commented 3 years ago

OK I see an example of this: https://apps.pathwaycommons.org/pathways?uri=http://pathwaycommons.org/pc12/Catalysis_61738de4a156b1f6b48691e258df24bb

@jvwong then should we just skip the non human genes?

We do support some non-human species (below).

onst SORTED_MAIN_ORGANISMS = [
  new Organism(2697049, 'SARS-CoV-2'),
  new Organism(227984, 'SARS-CoV'),
  new Organism(9606, 'Homo sapiens'),
  new Organism(10090, 'Mus musculus'),
  new Organism(ROOT_STRAINS.SCERVISIAE, 'Saccharomyces cervisiae', SCERVISIAE_STRAIN_IDS),
  new Organism(7227, 'Drosophila melanogaster'),
  new Organism(ROOT_STRAINS.ECOLI, 'Escherichia coli', ECOLI_STRAIN_IDS),
  new Organism(6239, 'Caenorhabditis elegans'),
  new Organism(3702, 'Arabidopsis thaliana'),
  new Organism(10116, 'Rattus norvegicus'),
  new Organism(7955, 'Danio rerio')
];

https://github.com/PathwayCommons/grounding-search/blob/master/src/server/datasource/organisms.js#L44-L56

metincansiper commented 3 years ago

We do support some non-human species (below).

Okay, I will make the filtering for that organism assuming that grounding search would work fine with them all.

metincansiper commented 3 years ago

@jvwong I made some work to eliminate the unsupported organisms (I did not commit it yet). However, grounding search is still not working for some supported organisms. One example I found is this gene with human organism: https://www.uniprot.org/uniprot/P06213 ({"id":"P06213","namespace":"uniprot"}). Also, the search for ncbi is also failing in some cases. The cases I checked were having human organism: (https://www.ncbi.nlm.nih.gov/gene/1956 and https://www.ncbi.nlm.nih.gov/gene/4610).

jvwong commented 3 years ago

What are you trying to search for again?

maxkfranz commented 3 years ago

https://grounding.baderlab.org doesn't seem like it's working and https://master.grounding.baderlab.org/ looks like it's down. Maybe there's another issue with the cluster

jvwong commented 3 years ago
metincansiper commented 3 years ago

What are you trying to search for again?

I am getting the ncbi grounding for the genes coming from the PSP file. First I make a mapping to ncbi id then I make call the grounding search for the mapped ncbi id. In case the original xref is from uniprot I am making the uniprot the ncbi mapping using the grounding search as well. I can tell that the problem is grounding search is failing for some parameters like {"id":"P06213","namespace":"uniprot"}, {"id":"1956","namespace":"ncbi"}, {"id":"4610","namespace":"ncbi"}. All of which represents human genes.

metincansiper commented 3 years ago

https://grounding.baderlab.org doesn't seem like it's working and https://master.grounding.baderlab.org/ looks like it's down. Maybe there's another issue with the cluster

I am also seeing that https://master.grounding.baderlab.org/ is being down frequently. Therefore, I was running grounding search in my localhost.

jvwong commented 3 years ago

https://grounding.baderlab.org doesn't seem like it's working and https://master.grounding.baderlab.org/ looks like it's down. Maybe there's another issue with the cluster

I am also seeing that https://master.grounding.baderlab.org/ is being down frequently. Therefore, I was running grounding search in my localhost.

It's timing out when trying to download ncbi/uniprot/chebi data. I'll put something up ASAP.

metincansiper commented 3 years ago

BTW the code updates I made to filter out the unsupported organisms are in this branch (https://github.com/PathwayCommons/factoid/tree/supplement_db_filter). I did not make a PR since the filtering does not look enough for now as I mentioned as:

I made some work to eliminate the unsupported organisms (I did not commit it yet). However, grounding search is still not working for some supported organisms. One example I found is this gene with human organism: https://www.uniprot.org/uniprot/P06213 ({"id":"P06213","namespace":"uniprot"}). Also, the search for ncbi is also failing in some cases. The cases I checked were having human organism: (https://www.ncbi.nlm.nih.gov/gene/1956 and https://www.ncbi.nlm.nih.gov/gene/4610).

jvwong commented 3 years ago

BTW the code updates I made to filter out the unsupported organisms are in this branch (https://github.com/PathwayCommons/factoid/tree/supplement_db_filter). I did not make a PR since the filtering does not look enough for now as I mentioned as:

I made some work to eliminate the unsupported organisms (I did not commit it yet). However, grounding search is still not working for some supported organisms. One example I found is this gene with human organism: https://www.uniprot.org/uniprot/P06213 ({"id":"P06213","namespace":"uniprot"}). Also, the search for ncbi is also failing in some cases. The cases I checked were having human organism: (https://www.ncbi.nlm.nih.gov/gene/1956 and https://www.ncbi.nlm.nih.gov/gene/4610).

  1. I tried this PhosphoSitePlus BioPAX conversion with your new branch and a local instance of everything (grounding, converter, index, db). INDRA can't handle all the requests and ends up complaining about too many requests (429), then to 'Unavailable' the to Internal Server error.
  2. I can't reproduce your 'grounding search' errors. These all work fine for me.
metincansiper commented 3 years ago

I can't reproduce your 'grounding search' errors. These all work fine for me.

Yes I they worked for me too. I remember like they were not working in the pas t but maybe I just made something wrong in the past. I do not know.

I tried this PhosphoSitePlus BioPAX conversion with your new branch and a local instance of everything (grounding, converter, index, db). INDRA can't handle all the requests and ends up complaining about too many requests (429), then to 'Unavailable' the to Internal Server error.

@jvwong @maxkfranz the speed was not so important for this service. Am I right about that?

jvwong commented 3 years ago

No speed doesn't matter at all. Maybe there's a way to filter the documents a little more as well.

metincansiper commented 3 years ago

@jvwong I am looking into Too Many Requests (429) but while doing that I again coincided some cases where grounding service is failing. One of them is {"id":"P00517","namespace":"uniprot"}. The cases I reported as not working in the past is working for me now but I see that new cases not working now.

jvwong commented 3 years ago

P00517

That's unsupported organism: Bos taurus (Bovine) https://www.uniprot.org/uniprot/P00517

metincansiper commented 3 years ago

That's unsupported organism: Bos taurus (Bovine) https://www.uniprot.org/uniprot/P00517

Yes, sorry for the confusion. I checked it and see that the I mistakenly undid the code I added for organism filtering and I did not check the organism since I was relying on the organism filtering.

metincansiper commented 3 years ago

I created #940 referencing this issue.