IHTSDO / snomed-database-loader

Represent SNOMED CT in a different types of databases
Other
108 stars 74 forks source link

Issue with loading Neo4J #6

Closed rorydavidson closed 6 years ago

rorydavidson commented 7 years ago

provided by another user and copied here I am working with SNOMED CT, and i have seen your code and I tried to load data into noa4j, but i had problems that maybe you will help me with this.

This is the problem.

run this

python snomed_g_graphdb_build_tools.py db_build --action create --rf2 C:/Users/Marcelo/Documents/ReleaseSnomed/SnomedCT_RF2Release_INT_20150731 --release_type full --neopw64 c21xcw== --output_dir C:/Users/marcelo/Documents/smqs

i got this

SNOMED_G bin directory [C:/Users/Marcelo/Downloads/SNOMED-CT-Database-master/SNOMED-CT-Database-master/NEO4J/] Traceback (most recent call last): File "C:\ProgramData\Anaconda3\lib\base64.py", line 517, in _input_type_check m = memoryview(s) TypeError: memoryview: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "snomed_g_graphdb_build_tools.py", line 328, in parse_and_interpret(sys.argv[1:]) # causes sub-command processing to occur as well File "snomed_g_graphdb_build_tools.py", line 325, in parse_and_interpret command_interpreters[command_index]1 # call appropriate interpreter File "snomed_g_graphdb_build_tools.py", line 198, in db_build if opts.mode=='build': neo4j = snomed_g_lib_neo4j.Neo4j_Access(base64.decodestring(opts.neopw64)) File "C:\ProgramData\Anaconda3\lib\base64.py", line 559, in decodestring return decodebytes(s) File "C:\ProgramData\Anaconda3\lib\base64.py", line 551, in decodebytes _input_type_check(s) File "C:\ProgramData\Anaconda3\lib\base64.py", line 520, in _input_type_check raise TypeError(msg) from err TypeError: expected bytes-like object, not str

kaicode commented 7 years ago

Hi Marcelo,

The Neo4J Snomed database loader scripts were contributed by by a third party, see attribution here - https://github.com/rorydavidson/SNOMED-CT-Database/tree/master/NEO4J#for-attribution

I am not familiar with this load script of the Python language but it appears from your stack trace that the script failed on line 198 of snomed_g_graphdb_build_tools.py which is attempting to decode the --neopw64 parameter.

https://github.com/rorydavidson/SNOMED-CT-Database/blob/master/NEO4J/snomed_g_graphdb_build_tools.py#L198

I can see in the README that in the examples given they have not shown how to quote the Neo4J password. As a workaround you could edit your copy of the script and insert your password directly to script. Replace line: if opts.mode=='build': neo4j = snomed_g_lib_neo4j.Neo4j_Access(base64.decodestring(opts.neopw64)) with: if opts.mode=='build': neo4j = snomed_g_lib_neo4j.Neo4j_Access("YOUR_PASSWORD")

If you have to make this sort of workaround it's best practice to not commit your password into any public git repository.

I will leave this issue open in the hope that someone else can provide a more elegant solution. Best of luck getting the script working.

Kind regards, Kai

wcampbel commented 7 years ago

1. What the error is saying.

The specific error that the command is reporting relates to decoding the base64 password, and that it is failing to decode the base64 string specified in the command.

base64.decodestring(opts.neopw64) <== generating exception for this user

It is odd in that I looked at the base64 string that was specified and it looked valid to me. Specifically:

--neopw64 c21xcw== <== looks okay at first blush, yet apparently causing problem for the user

2. Can we get environment information from the user?

Can we obtain information about the environment that the user is trying to use?

The windows environment I used in the past to build NEO4J graphs with SNOMED CT:

Windows 10 Anaconda Python, python version 2.7.13 py2neo library, version 3.1.1 DOS command line NEO4J 3.2.2 running on port 7474, an empty database prior to executing the python command

How to tell the version of py2neo?

From a python interpreter, execute the following python statements:

  import py2neo
  print(py2neo.__version__)

3. I have had a lot more success building the graph from Linux than from Windows.

It may or may not be possible for the user to try this from a Linux machine, but I have had much fewer problems building the graph on Linux. Just fyi.

I will help get this to work on Windows for this user, but it tends to be generally less painful on Linux.

wcampbel commented 7 years ago

I was able to build a NEO4J graph on windows, using the configuration specified in the previous email. I include the command I used and the output it generates.

The following is the command and the output of the command (which varies slightly from what the user had used, (differences? ==> specifying the /Full/ subfolder in the --rf2 parameter and using c:\temp\Users\Marcelo instead of c:\Users\Marcelo, as I don't have a Marcel user on my machine).

COMMAND:

python snomed_g_graphdb_build_tools.py db_build --action create --rf2 C:/temp/Users/Marcelo/ReleaseSnomed/SnomedCT_RF2Release_INT_20150731/Full/ --release_type full --neopw64 c21xcw== --output_dir C:/temp/Users/marcelo/Documents/smqs

OUTPUT:

SNOMED_G bin directory [C:/temp/Users/Marcelo/github--rorydavidson/SNOMED-CT-Database-master/NEO4J/] sequence did not exist, primed JOB_START FIND_ROLENAMES FIND_ROLEGROUPS MAKE_CONCEPT_CSVS MAKE_DESCRIPTION_CSVS MAKE_ISA_REL_CSVS MAKE_DEFINING_REL_CSVS TEMPLATE_PROCESSING CYPHER_EXECUTION CHECK_RESULT JOB_END RESULT: SUCCESS

CHECKING THE RESULT A BIT:

THen investigating the graph:

match (a:ObjectConcept) return count(a);

==> 421,657

So, it shows 421,651 SNOMED CT codes, not all of which are active.

To look for active concepts (from 2015-07-31)

match (a:ObjectConcept) where a.active='1' return count(a)

==> 317,057

So, apparently finding 317,057 active SNOMED CT concepts in the international 2015-07-31 release.

Can find the same information this way:

match (a:ObjectConcept {active:'1'}) return count(a)

jayped007 commented 7 years ago

I wonder if the issue is the version of python that is being used.

This software has been created and tested with python 2.7. It has not yet been upgraded to support python 3.x. It appears that Anaconda python was used, which allows for the installation of multiple versions of python. I suggest retrying with python 2.7 if possible.

mbonda commented 7 years ago

**Hello.

This work** if opts.mode=='build': neo4j = snomed_g_lib_neo4j.Neo4j_Access(base64.decodestring(opts.neopw64)) with: if opts.mode=='build': neo4j = snomed_g_lib_neo4j.Neo4j_Access("YOUR_PASSWORD")

But i have another problem.

C:\Users\Marcelo\Downloads\SNOMED-CT-Database-master\SNOMED-CT-Database-master\NEO4J>python snomed_g_graphdb_build_tools.py db_build --action create --rf2 C:/nuevo/ReleaseSnomed/ --release_type full --neopw64 c21xcw== --output_dir C:/Users/marcelo/Documents/smqs/ SNOMED_G bin directory [C:/Users/Marcelo/Downloads/SNOMED-CT-Database-master/SNOMED-CT-Database-master/NEO4J/] sequence did not exist, primed JOB_START FIND_ROLENAMES Traceback (most recent call last): File "snomed_g_graphdb_build_tools.py", line 330, in parse_and_interpret(sys.argv[1:]) # causes sub-command processing to occur as well File "snomed_g_graphdb_build_tools.py", line 327, in parse_and_interpret command_interpreters[command_index]1 # call appropriate interpreter File "snomed_g_graphdb_build_tools.py", line 308, in db_build save_and_report_results(DB, seqnum, stepnames, results_d, logfile) File "snomed_g_graphdb_build_tools.py", line 119, in init set_step_variables(stepname) File "snomed_g_graphdb_build_tools.py", line 100, in set_step_variables self.output = self.results_d[stepname] .get('STDOUT','').decode('utf-8') AttributeError: 'str' object has no attribute 'decode'

jayped007 commented 7 years ago

It appears to me that you are using python 3.x to execute this code, but this python code is written for python 2.7. It will be updated to work with python 3.x, but that has not yet happened.

In python 3.x, strings to not have a decode method, but they do in python 2.7.

Is it possible for you to try python 2.7?

If you are using Anaconda python, I believe you could do the following.

conda create -n py27 python=2.7 activate py27

you can switch back to your normal python version by

activate root

It would require installing the necessary libraries like py2neo and sqlitedict into your python 2.7.

mbonda commented 7 years ago

Exactly this was the problem thank @jayped007 , but now Houston I have a new problem

SNOMED_G bin directory [C:/Users/Marcelo/Downloads/SNOMED-CT-Database-master/SNOMED-CT-Database-master/NEO4J/] sequence did not exist, primed JOB_START FIND_ROLENAMES FIND_ROLEGROUPS MAKE_CONCEPT_CSVS MAKE_DESCRIPTION_CSVS MAKE_ISA_REL_CSVS MAKE_DEFINING_REL_CSVS TEMPLATE_PROCESSING CYPHER_EXECUTION FAILED (steps: ['CYPHER_EXECUTION'])

mbonda commented 7 years ago

Build.log

step:[CYPHER_EXECUTION],result:[FAILED (STATUS 83)],command:[python C:/Users/Marcelo/Downloads/SNOMED-CT-Database-master/SNOMED-CT-Database-master/NEO4J//snomed_g_neo4j_tools.py run_cypher build.cypher --verbose --neopw64 c21xcw==],status/expected:83/0,duration:0:00:01.665000,output:[],error:[],cmd_start:[2017-07-28 09:16:47.357000],cmd_end:[2017-07-28 09:16:49.022000]

jayped007 commented 7 years ago

What this means, is that the procedure has processed the SNOMED CT RF2 file, and created a CYPHER script (called build.cypher) to load nodes and edges to represent the SNOMED CT information in NEO4J. It has also built a significant number of CSV files that the build.cypher script depends on (in the directory specified by --output_dir).

The processing of the CYPHER code is what is apparently failing at the moment.

The python code assumes that NEO4J is running on the same machine, on port 7474. It uses the py2neo library to communicate with the NEO4J rest api (at URL localhost:7474). It also assumes that the NEO4J database is basically empty, or at least does not contain any of the types of nodes and edges that will be created by it (ObjectConcept nodes, Description nodes, RoleGroup nodes, ISA relationships, etc).

Is this the case? Do you have NEO4J running on that machine on port 7474. Can you tell me the NEO4J version?

The build.log is meant to have error information, when errors occur. Could you possibly post the whole build.log or examine it for further error information?

jayped007 commented 7 years ago

I don't know if this is the case for you or not, but I note that I have had issues when any of the directories involved contain spaces in their name. Like 'Program Files' or something like that, For example, trying to use C:\My Files, would probably cause issues (versus C:\MyFiles). I would suggest using directories that don't contain spaces when trying to use this software.

mbonda commented 7 years ago

Hello @jayped007 . I dont have spaces in the names: C:/Users/Marcelo/Downloads/SNOMED-CT-Database-master/SNOMED-CT-Database-master/NEO4J/ CSV: C:\Users\Marcelo\Documents\smqs

jayped007 commented 7 years ago

Something to consider:

1. Enable LOAD CSV to load files from any directory

NEO4J, I believe in version 3, creating a configuration option in neo4j.conf that relates to where the LOAD CSV command can load files from.

This is what you see by default in the configuration

dbms.directories.import=import

This disallows things like importing CSV files from C:/Users/Marcelo

I am guessing that perhaps this is the issue you are running into. And at least temporarily, I would comment out that configuration line ... and retry executing the procedure (which is trying to use LOAD CSV commands to load the CSV files it created).

To comment out the configuration item:

dbms.directories.import=import

Then restart NEO4J and retry.

So, if the issue is that the LOAD CSV statement is failing because it cant find the CSV files, then this is very likely the reason.

jayped007 commented 7 years ago

Here is how you get to the configuration files on WIndows

Select the "Options" button at the bottom of the form that is displayed when you click on NEO4J (the one that allows you to select a NEO4J database).

There is a set of 3 files, in NEO4J 3, that you can configure ... the first one is "neo4j.conf", and you click on the "Edit" button to bring it up in a text editor. That will allow you to modify the

dbms.directories.import=import

Item, changing it to be commented out.

mbonda commented 7 years ago

I commented out this but the problem continues, maybe this is the problem?

n4jpw =base64.decodestring(opts.neopw64) graph_db = py2neo.Graph(password=n4jpw)

jayped007 commented 7 years ago

Hi Marcelo,

I believe I can help you get you across the finish line on this.

Take a look at the file "build.cypher" file in a text editor; that file was been created the the software you have already run (along with many CSV files).

What you will find in there is a significant number of NEO4J CYPHER statements.

The job of these statements is to create indexes and constraints and then load the SNOMED CT content which has been placed in CSV files into nodes and edges into the NEO4J.

You could run these, one by one, in your NEO4J browser.

The initial ones which create indexes and constraints should execute with no problem at all.

For example:

CREATE CONSTRAINT ON (c:ObjectConcept) ASSERT c.id IS UNIQUE; CREATE CONSTRAINT ON (c:ObjectConcept) ASSERT c.sctid IS UNIQUE;

I suspect strongly that the LOAD CSV CYPHER commands are the ones that are failing.

For example, when I tried to recreate your situation, I had the following LOAD CSV command as the first one in my build.cypher:

USING PERIODIC COMMIT 200 LOAD csv with headers from "file:///C:/temp/Users/marcelo/Documents/smqs/concept_new.csv" as line with line CREATE (:ObjectConcept { nodetype: 'concept', id: line.id, sctid: line.id, active: line.active, effectiveTime: line.effectiveTime, moduleId: line.moduleId, definitionStatusId: line.definitionStatusId, FSN: line.FSN, history: line.history} );

This command worked for me.

If you find the corresponding first LOAD CSV command in your build.cypher, and try to execute it in your NEO4J browser, presumably at:

localhost:7474

It should presumably fail, and the error that it generates will be the same error that is occuring when the software is trying to perform this same operation now.

If you could let me know what error you are seeing, then I think I can help you move forward.

Thanks!

Jay Pedersen

mbonda commented 7 years ago

HELLO @jayped007 , I ran one by one th scripts into Neo4j and I found this problem:

There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.

jayped007 commented 7 years ago

What this tells me is that your NEO4J server does not have enough configured memory to perform this operation. The LOAD CSV commands from build.cypher are failing because the NEO4J server itself is failing when trying to execute them due to memory issues. So, now we are moving into the arena of Java issues. The following notes give some direction on trying to fix the Java memory issues.

NOTES There is a NEO4J 2.x/3.xconfiguration file on Windows, known as neo4j-community.vmoptions, which modifies the Java virtual machine settings. Prominent among those settings is the -Xmx<size> setting which controls the heap size for Java. The same dialog box for NEO4J on Windows that allowed modifying neo4j.conf, has a button for modifying the vmoptions configuration which I believe is labeled "Java VM Tuning".

My neo4j-community.vmoptions file currently only has comments

# Enter one VM parameter per line, note that some parameters can only be set once.
# For example, to adjust the maximum memory usage to 512 MB, uncomment the following line
# -Xmx512m

In your case, you may want to try changing the line from

# -Xmx512m

To

-Xmx2G

That is, if you have 2 GB of memory memory available for NEO4J.

Make a similar change, restart NEO4J, and retry -- see if that allows you to move forward.

In the past, on a machine with 16GB of memory, I used the following configuration:

-Xmx5G
-Xms3G
-Xss2G

Articles that might be of use:

https://stackoverflow.com/questions/43078285/error-importing-my-csv-data-to-neo4j-java-heap-space

http://neo4j.com/docs/operations-manual/current/performance/#heap-sizing

mbonda commented 7 years ago

Finalemten I tried with the server version and everything worked fine, I'll try the Spanish version.

Thank @kaicode , @rorydavidson , @wcampbel , @jayped007 , @aflinton .

jayped007 commented 7 years ago

I just love a happy ending. 👍 Let us know how the Spanish version loading works.

adrianalonsoba commented 7 years ago

I'm having problems with the spanish version, @mbonda , did you get it to work?

mbonda commented 7 years ago

hola @adrianalonsoba si no me funciono, en breve voy a armar una con base en la versión en español, de donde eres Adrian ?

adrianalonsoba commented 7 years ago

soy español, tu? pero no tengo problemas en hablar en inglés si quieres... parece que hay algún problema con el parseo en español de los ficheros... ¿Así que no conseguiste que funcione? yo tengo particular interés en hacerlo en neo4j.

mbonda commented 7 years ago

este es mi mail mbondarenco@gmail.com, estoy empezando un proyecto con snomed y noe4j principalmente como herramienta para visualizar, te tendré al tanto de mis avances. Soy de Uruguay.

adrianalonsoba commented 7 years ago

adrian.alonsoba@gmail.com el mío, te lo agradezco, si consigo hacer avances también los compartiré contigo.

wcampbel commented 7 years ago

Si ustedes tienen problemas con Neo4j y como importar SNOMED CT, podemos ayudarles. Hay que tener versiones correctas de python y unos otros programas para usar nuestro algoritmo.

W. Scott Campbell, PhD, MBA Assistant Professor Director of Public Health Laboratory Informatics and Pathology Laboratory Informatics Department of Pathology/Microbiology University of Nebraska Medical Center 985900 Nebraska Medical Center Omaha NE 68198-5900 402-559-9593 (office) 402-350-7851 (mobile)


From: Adrián notifications@github.com Sent: Wednesday, November 8, 2017 5:46 AM To: IHTSDO/snomed-database-loader Cc: Campbell, Walter S; Mention Subject: Re: [IHTSDO/snomed-database-loader] Issue with loading Neo4J (#6)

adrian.alonsoba@gmail.commailto:adrian.alonsoba@gmail.com el mío, te lo agradezco, si consigo hacer avances también los compartiré contigo.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/IHTSDO/snomed-database-loader/issues/6#issuecomment-342794026, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIggLvGLhBiW_XP5OLTA-8gVHQ3MMd69ks5s0ZSSgaJpZM4Oipld.

The information in this e-mail may be privileged and confidential, intended only for the use of the addressee(s) above. Any unauthorized use or disclosure of this information is prohibited. If you have received this e-mail by mistake, please delete it and immediately contact the sender.

adrianalonsoba commented 7 years ago

Thanks for your answer @wcampbel, i have uploaded the internartional version with no errors, but when i try to add the spanish extension it raises several parsing errors... maybe due to spanish character conflicts.

adrianalonsoba commented 7 years ago

Solved, simply by changing the file names to match with the international files, it seems that the spanish extension is uploaded successfully

wcampbel commented 7 years ago

Fantastic! Let me know if you have any questions or comments.

W. Scott Campbell, PhD, MBA Assistant Professor Director of Public Health Laboratory Informatics and Pathology Laboratory Informatics Department of Pathology/Microbiology University of Nebraska Medical Center 985900 Nebraska Medical Center Omaha NE 68198-5900 402-559-9593 (office) 402-350-7851 (mobile)


From: Adrián notifications@github.com Sent: Thursday, November 9, 2017 2:44 AM To: IHTSDO/snomed-database-loader Cc: Campbell, Walter S; Mention Subject: Re: [IHTSDO/snomed-database-loader] Issue with loading Neo4J (#6)

Solved, by simply change the file names to match with the international files, it seems that the spanish extension is uploaded successfully

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/IHTSDO/snomed-database-loader/issues/6#issuecomment-343085826, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIggLgwqvkx_JUcmwgyOAMMHr9PkxEDyks5s0rt-gaJpZM4Oipld.

The information in this e-mail may be privileged and confidential, intended only for the use of the addressee(s) above. Any unauthorized use or disclosure of this information is prohibited. If you have received this e-mail by mistake, please delete it and immediately contact the sender.

adrianalonsoba commented 7 years ago

I am currently exploring the relations model, @wcampbel thank you so much for sharing your fantastic work.