NCATS-Gamma / robokop

Master UI for ROBOKOP
MIT License
15 stars 3 forks source link

Cannot replicate prior results #511

Closed karafecho closed 4 years ago

karafecho commented 4 years ago

I am preparing a manuscript based on an analysis I conducted back in Jan/Feb 2020. I attempted to take a closer look at the queries and answer sets that I plan to highlight in the manuscript, but I receive an error message regarding gene nodes:

image

I then attempted to rerun the queries, but I do not receive the same answer sets. (As an aside, the new answer sets seem pruned, i.e., not as many answer subgraphs.)

I am not willing to start from scratch with analysis (spent enough time on this as is), so how do I reference the version of ROBOKOP that I used?

cbizon commented 4 years ago

This is difficult, because it's true that the database has continued to evolve. Now, we do have backups and we could go back to an older version. But often the older versions are not as good.

In particular, in the older version of the database, we were normalizing genes to HGNC, but this is not biolink compliant. When we moved over to the compliant form of genes, that's when you get the "missing node" because the old identifier is different from the new identifier.

That said, rerunning the query should give you the same or at least similar answers, so that failing is potentially an error with the rebuild. Can you tell me what gene(s) you expected in the query above? Then we can go to the db and see why it's not there.

cbizon commented 4 years ago

I tried a rebuild of the query that I think you are running? https://robokop.renci.org/q/3ff92fcb-f8e5-47ef-bab2-d0b9de10a3c5/

I did get answers back, though I don't know what the answers were previously.

karafecho commented 4 years ago

I understand the issue and appreciate the details (makes more sense now), but this is a challenge that really needs to be addressed somehow.

The genes that were originally returned are: TNF (tumor necrosis factor); BDNF (brain-derived neurotrophic factor); IL-10 (interleukin-10); NGF (nerve growth factor); IRF8 (interferon regulatory factor 8); and KCNMA1 (potassium calcium-activated channel subfamily M alpha 1). The genes that were returned from your query above are: IRF8; BDNF; and KCNMA1. This is what I meant by "pruning".

cbizon commented 4 years ago

Yes, I don't disagree that we should figure out a way to address it, though I do wonder how to resource it.

On the pruning front, it sounds like we lost IL-10, TNF, and NGF. Is there any chance that you know what the edge sources for these genes were? If not, we can go back to an older graph and see there...

karafecho commented 4 years ago

The question re resources is an important one, but it's tough. We need to find the right balance between addressing issues in an effort to retain users and ensure that journal, grant reviewers are not surprised when testing the app versus investing time on an unfunded project.

karafecho commented 4 years ago

WRT your suggestion re IL-10, TNF, and NGF, I don't think this is worth spending time on right now. We can explore it more if a reviewer raises a question. (See previous post.) Does this seem reasonable?

cbizon commented 4 years ago

For your purposes yes, but I'm trying to root out any missing data issues now, so following up on missing things is important at the moment.

karafecho commented 4 years ago

Oh, okay.

Multiple sclerosis – TNF association was established by both HETIO and Pharos. Carbon monoxide – TNF association was established by CTD, with publication support again indicating a role for heme oxygenase-1.

image

Unfortunately, I didn't capture informative screenshots for the other two genes.

cbizon commented 4 years ago

No, that's very helpful thanks

cbizon commented 4 years ago

OK, it looks like this is sorted out: image

karafecho commented 4 years ago

Whew! Thanks!