QualiMaster / qm-issues

2 stars 0 forks source link

Run pipelines on LUH cluster #62

Open eichelbe opened 7 years ago

eichelbe commented 7 years ago

For testing and backup-plan (1) Run RandomPip on LUH cluster (SUH) (2) Replace storm to Adaptive Storm* and change worker configuration for larger pipelines (SUH) (3) Run RandomPip on LUH cluster again (SUH) (4) Run TransferPip on LUH cluster (LUH) (5) Run FocusPip on LUH cluster (LUH) (6) For TSI: Run Time Travel Pipeline for startup time debugging on LUH cluster (SUH + TSI) (7) Rund TransferPip + FocusPip on LUH cluster (8) Test QM-IConf connection into LUH cluster (requires SSH tunneling) (9) Test Application connection into LUH cluster (requires SSH tunneling)

Please document process below! After (3) we will communicate how to work with the cluster on the project Wiki!

*required for adaptation and for identifying timing issues (extension over storm, can be further extended if needed)

eichelbe commented 7 years ago

Cui is working on (1)

eichelbe commented 7 years ago

-> RandomPip is running (1)+(2)+(3) are done

smutahoang commented 7 years ago

@eichelbe @cuiqin That's great! Thanks for your effort. We look forward for the wiki page on how to run the pipelines on LUH clusters so that we can proceed with the next steps.

eichelbe commented 7 years ago

;) It's also in our interest....

And here is the extended LUH cluster wikipage (the lower parts are mostly "dangerous"...)

ap0n commented 7 years ago

Great! So, how do we continue? Do we get accounts on LUH cluster or do we coordinate with SUH and test together (mainly interested in (6) :angel:)?

eichelbe commented 7 years ago

Not sure, this is LUH decision. We had to sign some papers that we loose all our money if we download their data ;) Just kidding, I think we will take your pipeline and run it there for you reporting on the issues that we see.

cuiqin commented 7 years ago

Adaptive switch over RandomPip is working.

eichelbe commented 7 years ago

As a follow-up - LUH infrastructure is now working with coordination.commandCompletion.onEvent = true. Will become active with next restart.

eichelbe commented 7 years ago

For (4)+(5) the financial data source may need an adjustment to read from the file system rather than HDFS (not available in LUH cluster). Either there is a way to convince Miroslav or we have to modify the TSI source. Initial discussion with Christoph regarding the Twitter source (works with file system) are ongoing to synchronize the work and the approach...

ap0n commented 7 years ago

But can't the Okeanos HDFS be also used? (It will be slower because of the internet but it could be an alternative at least for testing...)

L3SQualimaster commented 7 years ago

On Thu, 19 Jan 2017 03:12:32 -0800 Holger Eichelberger notifications@github.com wrote:

Hi Holger,

[storm@master02 ~]$ hadoop fs -put ./test.hdfs /data/storm/ [storm@master02 ~]$ hadoop fs -ls /data/storm/ Found 1 items -rw-r--r-- 3 storm users 5 2017-01-19 13:57 /data/storm/test.hdfs [storm@master02 ~]$ hadoop fs -cat /data/storm/test.hdfs test

Cheers, miroslav

For (4)+(5) the financial data source may need an adjustment to read from the file system rather than HDFS (not available in LUH cluster). Either there is a way to convince Miroslav or we have to modify the TSI source. Initial discussion with Christoph regarding the Twitter source (works with file system) are ongoing to synchronize the work and the approach...

-- Dr. Miroslav Shaltev

eichelbe commented 7 years ago

Cool. From our meeting I had in mind that we do not have access to HDFS. Then let's keep things simple, put data under /data/storm in HDFS and configure the infrastructure respectively. I will tell TSI...

ap0n commented 7 years ago

Just to let you know, today I spotted another attack at our HDFS; this time data ransom was included! According to this, these attacks are quite widespread.

eichelbe commented 7 years ago

How bad, as the cluster of LUH is behind an external server, there the risk shall be lower.

@L3SQualimaster any news on the FocusPip?

eichelbe commented 7 years ago

Infrastructure setup has been changed to take up the LUH hdfs with base path /data/storm as suggested by Miroslav (thanks again). The respective setup information is being passed to the worker by the infrastructure via the DML configuration class (getHdfsUrl(), getHdfsPath() as postfix to the HDFS URL, getSimulationLocalPath(), useSimulationHdfs() as well as getDfsPath()). Infrastructure update and restart of infrastructure are needed.

It's now up to the sources to take up this information and to copy the required data into the HDFS. How is the time plan there?

ap0n commented 7 years ago

SpringClientSimulator is modified to use these settings (committed about half an hour ago)

eichelbe commented 7 years ago

Fine :)

eichelbe commented 7 years ago

BTW, the HDFS user/group setup is not done, so the infrastructure may complain about that while extracting the setup files for Stefan. But as far as I know, Tuan is trying to find a way how the Stakeholder applications could access the pipeline results, then we can also figure out whether extracting the setup files also holds/is needed in the LUH setup.

eichelbe commented 7 years ago

... the installation needs a further classpath entry for HDFS. Discussing with Miroslav...

ChristophHubeL3S commented 7 years ago

Hi everyone, I got an error while trying to start the infrastructure on hadoop2:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration at eu.qualimaster.dataManagement.storage.hdfs.HdfsUtils.getFilesystem(HdfsUtils.java:97) at eu.qualimaster.dataManagement.storage.hdfs.HdfsUtils.getFilesystem(HdfsUtils.java:82) at eu.qualimaster.dataManagement.storage.hdfs.HdfsUtils.clearFolder(HdfsUtils.java:232) at eu.qualimaster.coordination.RepositoryConnector.readModels(RepositoryConnector.java:537) at eu.qualimaster.coordination.RepositoryConnector.initialize(RepositoryConnector.java:441) at eu.qualimaster.coordination.RepositoryConnector.(RepositoryConnector.java:330) at eu.qualimaster.coordination.CoordinationManager.start(CoordinationManager.java:417) at eu.qualimaster.adaptation.platform.Main.startupPlatform(Main.java:72) at eu.qualimaster.adaptation.platform.Main.main(Main.java:117) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 9 more

(I tried to run ./main.sh)

eichelbe commented 7 years ago

That's exactly what my last post was about. Therefore, I need a change done by root. And please do not use main.sh on your cluster. There the infrastructure is installed as a service. I'll let you know as soon as the classpath entry is clarified with Miroslav (or you may call him directly ;))

eichelbe commented 7 years ago

Ok, fixed thanks to Miroslav. Updated the Wiki in this respect (main.sh).

Started PriorityPip. Comes up and reports java.lang.NullPointerException at eu.qualimaster.algorithms.imp.correlation.SpringClient.getSpringStream(SpringClient.java:49) at eu.qualimaster.PriorityPip.topology.PriorityPip_Source0Source.nextTu

Here is the full trace node14: java.lang.NullPointerException: null node14: at eu.qualimaster.algorithms.imp.correlation.SpringClient.getSpringStream(SpringClient.java:49) ~[stormjar.jar:na] node14: at eu.qualimaster.PriorityPip.topology.PriorityPip_Source0Source.nextTuple(PriorityPip_Source0Source.java:148) ~[stormjar.jar:na] node14: at backtype.storm.daemon.executor$fn__5886$fn__5901$fn__5930.invoke(executor.clj:759) ~[storm-core-0.9.5.jar:0.9.5] node14: at backtype.storm.util$async_loop$fn__565.invoke(util.clj:475) ~[storm-core-0.9.5.jar:0.9.5] node14: at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] node14: at java.lang.Thread.run(Thread.java:745) [na:1.8.0_73]

hdfs.path, hdfs.url are set up. Simulation settings as discussed in #59 so far not. Are they needed.

ChristophHubeL3S commented 7 years ago

Thanks for fixing! The FocusPip starts now, but immediately crashes because of the FinancialSource which is not a surprise given that the financial data is not yet reachable.

Btw, I would not use the PriorityPip for testing since it is very old and I think nobody is maintaining it.

eichelbe commented 7 years ago

Ok (see trace above), how to set up the financial simulation source? @ap0n

ChristophHubeL3S commented 7 years ago

Apostolos: we can add the financial dataset here: /local/home/storm/datasets

Then we just have have to adapt the Source.

eichelbe commented 7 years ago

Or have them in hdfs as it should be by now ;)

npav commented 7 years ago

Hi, this error has to do with the live data source, not the simulator. I am guessing that since everything is being executed on the LUH cluster, the source cannot find the Spring API accounts to use for requests.

Did you also move the "accounts.properties" file (located under "/var/nfs/" in Okeanos cluster) ?

ChristophHubeL3S commented 7 years ago

I didn't move anything, I created a new Pipeline. If additional files are needed, we can move them.

eichelbe commented 7 years ago

I will check, because the paths on LUH are different. Anyway, it would be important to understand how to get also the simulator running...

eichelbe commented 7 years ago

Ok, copied the file from okeanos to LUH and distributed it to the workers. Should not require a restart of the infrastructure, so it's worth giving it another trial...

ChristophHubeL3S commented 7 years ago

I tried again. It seems to crash with the same exception.

cuiqin commented 7 years ago

Just to your information...sometime ago I tried to run the PriorityPip with SpringClientSimulator on Okeanos cluster, I also had the same NullPointerException on the source.

npav commented 7 years ago

Holger's trace has to do with SpringClient, not SpringClientSimulator.

Can I see the logs somewhere? e.g. a link like: http://snf-618454.vm.okeanos.grnet.gr:8000/log?file=worker-6703.log

cuiqin commented 7 years ago

Okay, I see. If it's SpringClient, then please just forget it. I did not save the link. But it was sometime ago, If I test again, I will keep an eye on it.

ChristophHubeL3S commented 7 years ago

When I try to access the logs, I get a server not found error: http://node15.ib:8000/log?file=worker-6700.log

eichelbe commented 7 years ago

I think that Christoph can replace the node name by its IP (10.10.1.15) and get the logs (or ask Miroslav about the name resolution). We probably cannot define overlapping SSH tunnels for all nodes and have to use the console. And TSI can only rely on fragments of the logs sent by one of us I fear...

ChristophHubeL3S commented 7 years ago

When calling http://10.10.1.15:8000/log?file=worker-6700.log the connection times out.

npav commented 7 years ago

So, not even Storm UI can be accessed from the "outside world"?

ChristophHubeL3S commented 7 years ago

I can access the UI, just not the logs.

eichelbe commented 7 years ago

No, the cluster is really internal. Not so easy to influence pipelines/the infrastructure... (you know what I mean). You can see the StormUI only via an SSH tunnel and then accessing the workers becomes tricky as far as we know, but there is always a console ;)

eichelbe commented 7 years ago

And for the timeout... I think it's best to ask Miroslav for the internal LUH connections...

npav commented 7 years ago

Could you send me the full log file related to the NullPointerException in SpringClient so I can check?

ap0n commented 7 years ago

Hi all! In order to use the spring client simulator the following settings must be in /var/nfs/qm/qm.infrastructure.cfg file. simulation.useHdfs = true hdfs.url=hdfs://snf-618466.vm.okeanos.grnet.gr:8020 # use appropriate url for LUH cluster hdfs.path=/user/storm/ # use appropriate path for LUH cluster

Then the dataset files should be under hdfs.path. For the above example: `/user/storm/Symbollist.txt' and '/user/storm/data.txt'

edit The exception that Cui got with the spring client simulator was related to the HDFS hack (no data were available to read)

ap0n commented 7 years ago

Hm.. On a second thought, does the LUH cluster have a /var/nfs/qm/ directory? If not, where is the qm.infrastructure.cfg file?

SpringClientSimulator loads the DataManagementConfiguration from that file...

static {
    DataManagementConfiguration.configure(new File("/var/nfs/qm/qm.infrastructure.cfg"));
  }
eichelbe commented 7 years ago

Hi, does this need any change to the configuration model or can we just apply it to the pipelines as they are?

eichelbe commented 7 years ago

No, the LUH cluster does not have NFS at all, but this is (so far) not a problem. The infrastructure.cfg file is just on Nimbus, the respective settings are distributed by the infrastructure to the workers. The data files are distributed explicitly before running a pipeline.

Hmm. Is the clientSimulator a program or a storm component. Where shall it be executed?

ap0n commented 7 years ago

The change to the simulator was committed yesterday. If you generated the pipelines earlier than that, you have to generate them again.

SpringClientSimulator is the algorithm of the SimulatedFinancialData source

ChristophHubeL3S commented 7 years ago

I generated the FocusPip today, so that should be no problem.

eichelbe commented 7 years ago

So we have to change the configuration to get the simulation working... :|

npav commented 7 years ago

Let's recap, because we are confused as to what the problems exactly are.

@eichelbe You tried to execute the PriorityPip, using SpringClient (which is the live and not the simulated/replay data source), and it threw a NullPointerException, right? Was the pipeline freshly generated, or was it an old build?

@ChristophHubeL3S You generated the FocusPip, but we are still unsure whether you used the live data source (FocusFinancialData) or the simulated/replay data source (SimulatedFocusFinancialData). Could you clear this up? Also, what error are you getting?

Finally, if you could send us more descriptive logs (e.g. worker log files) about the errors that occur, we can check to see what possible problems exist.