ResearchSoftwareInstitute / greendatatranslator

Green Team Data Translator Software Engineering and Development
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Chemotext3 #111

Open stevencox opened 6 years ago

stevencox commented 6 years ago

@balhoff is annotating PubMed abstracts from Medline with identifiers from a variety of ontologies.

We'd like to parallelize his work using Spark since this should speed up the pipeline substantially.

This notebook should get us there but fails to import classes in JAR files that have been loaded as dependencies. Investigate why this is happening.

balhoff commented 6 years ago

Could it be the Scala version? Does Zeppelin have a requirement? This project uses 2.11 because Neo4j has a 2.11 dependency (I use 2.12). But I wonder if Zeppelin is on 2.10.

stevencox commented 6 years ago

I suppose that could also be a problem. But it can't even import a class from the json JAR. i.e. things in straight Java land. I'm thinking scala version should not impact that, eh?

stevencox commented 6 years ago

All quiet in the interpreter log: /projects/stars/stack/zeppelin/zeppelin-0.7.3-bin-all/logs/

 INFO [2018-02-23 23:01:38,519] ({pool-2-thread-4} Logging.scala[logInfo]:54) - Added JAR /projects/stars/pubmed/pubmed-terms/target/universal/stage/lib/ at spark:// with timestamp 1519444898519 
 INFO [2018-02-23 23:01:38,519] ({pool-2-thread-4}[open]:947) - sc.addJar(/projects/stars/pubmed/pubmed-terms/target/universal/stage/lib/ 
 INFO [2018-02-23 23:01:38,519] ({pool-2-thread-4} Logging.scala[logInfo]:54) - Added JAR /projects/stars/pubmed/pubmed-terms/target/universal/stage/lib/net.sf.trove4j.trove4j-3.0.3.jar at spark:// with timestamp 1519444898519 
 INFO [2018-02-23 23:01:38,520] ({pool-2-thread-4}[open]:947) - sc.addJar(/projects/stars/pubmed/pubmed-terms/target/universal/stage/lib/net.sf.trove4j.trove4j-3.0.3.jar) 
 INFO [2018-02-23 23:01:38,520] ({pool-2-thread-4} Logging.scala[logInfo]:54) - Added JAR /projects/stars/pubmed/pubmed-terms/target/universal/stage/lib/com.typesafe.ssl-config-core_2.11-0.2.2.jar at spark:// with timestamp 1519444898520 
 INFO [2018-02-23 23:01:38,521] ({pool-2-thread-4}[open]:947) - sc.addJar(/projects/stars/pubmed/pubmed-terms/target/universal/stage/lib/com.typesafe.ssl-config-core_2.11-0.2.2.jar)
 INFO [2018-02-23 23:01:38,522] ({pool-2-thread-4}[populateSparkWebUrl]:1013) - Sending metainfos to Zeppelin server: {url=}
 INFO [2018-02-23 23:01:38,523] ({Thread-22} Logging.scala[logInfo]:54) - Mesos task 4 is now TASK_RUNNING 
 INFO [2018-02-23 23:01:38,523] ({Thread-23} Logging.scala[logInfo]:54) - Mesos task 3 is now TASK_RUNNING 
 INFO [2018-02-23 23:01:38,556] ({pool-2-thread-4}[jobFinished]:137) - Job remoteInterpretJob_1519444893072 finished by scheduler org.apache.zeppelin.spark.SparkInterpreter2040526988 
 INFO [2018-02-23 23:01:40,463] ({dispatcher-event-loop-1} Logging.scala[logInfo]:54) - Registered executor NettyRpcEndpointRef(spark-client://Executor) ( with ID 2 
balhoff commented 6 years ago

Pubmed to RDF code is currently here:

@stevencox should I move to this org, or somewhere else?

stevencox commented 6 years ago

It could go in Tangerine

@cbizon - I'm guessing as a data source, you're not looking to see it in Gamma. Correct me if I've got that wrong.

balhoff commented 6 years ago

A first pass triplestore has been created. I am rebuilding the triplestore now to add some content missed in the first run (extract substance terms from Pubmed data, add ChEBI ontology and HGNC gene symbols).

Next step is to create a SmartAPI using grlc.