aphp / UimaOnSpark

Way to run Uima Pipelines on Apache Spark
10 stars 4 forks source link
spark uima

MimicSectionSegmenter

Run on the segmenter

  1. download spark 2.2 or higher, and unpack it in <spark_folder>
  2. clone and compile the uima-aphp project
  3. clone this project
  4. put the uima-aph/uima-segmenter/target/uima-segmenter-1.0-standalone.jar[1] under the UimaOnSpark/lib/ folder
  5. compile this project with sbt publish-local
  6. copy the target/scala-2.11/uimaonspark_2.11-0.1.0-SNAPSHOT.jar[2]
  7. copy NOTEEVENTS.csv.gz, ref_doc_section.csv, [1] and [2] into a <working_folder>
  8. run the spark command
  9. the resulting csv will be in t

Spark Command

Standalone

<spark_folder>/sbin/start-master.sh
<spark_folder>/sbin/start-slave.sh  spark://0.0.0.0:7077 -c 4
<spark_folder>/bin/spark-submit \
--class fr.aphp.wind.uima.spark.MimicSectionSegmenter \
--jars uima-segmenter-1.0-SNAPSHOT-standalone.jar,uimaonspark_2.11-0.1.0-SNAPSHOT.jar \
--files ref_doc_section.csv \
--master spark://0.0.0.0:7077 \
--executor-cores 1 \
uimaonspark_2.11-0.1.0-SNAPSHOT.jar \
/tmp/ \
note_nlp.csv \
NOTEEVENTS.csv.gz \
200

Below are outdated information


NotePhiAnnotator

Run UIMA pipelines over Spark

uimaFIT

Apparently, no problem thanks to simplifiing and removing xml stuff

UIMA

When loading an existing pipe from xml descriptor into uimaFIT pipeline keep in mind:

General Notes

Performances considerations

  1. config 1: classic uimaFIT, 1 core
  2. config 2: classic uimaFIT, 2 cores (parallel run of half dataset)
  3. config 3: spark, 1 slave / 2 cores
  4. config 4: spark, 1 slave / 4 cores

Apparently, running separate instances of uimaFIT is equivalent in terms of performances to running them into spark. However, while adding a new layer with spark, this allows to distribute the pipelines over multiple computers, in parallell from one command. It is then possible to scale from one to thouthand of computers easily.

NoteDeid

How to run (standalone)

  1. Run the master: sbin/start-master.sh
  2. Run the slave: sbin/start-slave.sh spark://nps-HP-ProBook-430-G2:7077
  3. Submit the job: bin/spark-submit --files dictionary.xsd,DictionaryAnnotator.xml,RegExAnnotator.xml,dictionary.xml,dictionary2.xml --master spark://nps-HP-ProBook-430-G2:7077 natus/lib/logquery_2.11-0.1.0-SNAPSHOT.jar

How to run (yarn)

  1. push all jars, xml, txt files on one of the computer cluster
  2. push all the txt files on hdfs (=input_path)
  3. `/usr/hdp/2.5.0.0-1245/spark2/bin/spark-submit --jars NoteDeid-1.0-SNAPSHOT-standalone.jar,uima-an-dictionary.jar --files DictionaryAnnotator.xml,RegExAnnotator.xml,dictionary.xml,dictionary2.xml --master yarn-client --num-executors 8 --driver-memory 512m --executor-memory 512m --executor-cores 1 logquery_2.11-0.1.0-SNAPSHOT.jar $input_path $output_path
  4. it is crucial to put only one executor core. It looks like the CAS is shared otherwize, and this leeds job to fail. In the case of 1 core executor, the pipes looks like to be run independently on multiple cores (paradoxaly)

Run SectionSegmentation

RUN

NEEDS

INPUT

OUTPUT

HOW

TODO

Run SectionSegmentation

RUN

NEEDS

INPUT

OUTPUT

HOW

TODO