MetabolicAtlas / data-generation

Process the raw data-files for ingestion into the Neo4j database
MIT License
0 stars 0 forks source link

Do not include DROP INDEX statements by default. #2

Closed e0 closed 4 years ago

e0 commented 4 years ago

Since statements like DROP INDEX ON :Metabolite(id); is invalid if the index is not already created, the rest of the cypher query would not run.

As it already says in the TODO inside index.js, let's add a command line option to include the DROP INDEX statements dynamically. But I think it is better that by default they are not included.

mihai-sysbio commented 4 years ago

There are multiple routes to deal with this: a) use call db.indexes() yield .. to find the indexes to drop b) use drop database - an enterprise feature, but maybe we can get that for free c) in the docker build stage, before starting up neo4j, clear the database folder

To me, the cleanest and easiest sounds (c) - it would guarantee that we always get a fresh start when doing an import.

e0 commented 4 years ago

In the build stage for production, it always starts from a fresh neo4j image so the database would always be fresh/reset. The problem here is that trying to call DROP INDEX... becomes an invalid statement if no such index exists. I think the output file should prioritise production and not include and DROP INDEX... by default.

For the development environment, I don't think we should even include the import db step in the build step. It feels reasonable to copy and paste the cypher instructions manually into the neo4j browser.

pecholleyc commented 4 years ago

My bad I forgot about this problem, I would, myself, make it more compatible for every situation. The import script implies to add new data to the - empty or not - neo4j database, meaning indexes should be systematically (re-)created, so be systematically dropped to ensure it. I would opt for option a) or a more simpler equivalent option if possible (in SQL there is an easy way to solve this problem with the DROP INDEX IF EXISTS statement).

mihai-sysbio commented 4 years ago

Perhaps apoc.schema.assert is a good alternative to (a). I'm really up for having the same way of importing in production and in development, even if it would increase the build time for development. From what I see, building is not required that often for development thanks to hot reloading in both frontend and api.

e0 commented 4 years ago

I think the build time to add the data import would be pretty negligible. Is there any reason why you would want to have the same setup?

I was thinking for development we might want to experiment with different data frequently enough so it's good to encourage the usage of the neo4j browser. It feels like rebuilding the stack every time we want to import different data is unnecessarily more time consuming. Of course one could still use the neo4j browser but I personally would prefer enforcing a single method of importing the data.

I don't feel too strongly about this though. Just want to provide my reasoning here. Feel free to respond with a decision on this and I can make the changes accordingly.

mihai-sysbio commented 4 years ago

The same setup brings simplicity (no "if" involved), and robustness (guarantees we test this same method thoroughly by using it routinely in development, so we can pick up any problems that might surprise us during deployment). And by "same setup" I mean minimizing as much as possible the differences between development and production, pretty much as you say @e0 :

a single method of importing the data

Please take all necessary steps for simplicity and robustness.

pecholleyc commented 4 years ago

Could not get the apoc to work so, for the time being, I have removed the drop index instructions by default. They can be added with the --drop-indexes arg. commit bbb4cab6361ac51d47d65c0461b183f4badf2a40

e0 commented 4 years ago

To clarify, the "single method" I mentioned above didn't imply the same method for deployment and production. I meant that by having the import during build step, in addition to the possibility of importing using the neo4j browser, there are two methods available of importing data for the development environment. I do like your point on the chance to pick up deployment problems though.

Thank you all for the discussion on this topic. Good that you added the new commit now as well @pecholleyc. With this, I think we can close this issue and I can make the changes to deployment build in the other repo.