QualiMaster / qm-issues

2 stars 0 forks source link

Run pipelines on LUH cluster #62

Open eichelbe opened 7 years ago

eichelbe commented 7 years ago

For testing and backup-plan (1) Run RandomPip on LUH cluster (SUH) (2) Replace storm to Adaptive Storm* and change worker configuration for larger pipelines (SUH) (3) Run RandomPip on LUH cluster again (SUH) (4) Run TransferPip on LUH cluster (LUH) (5) Run FocusPip on LUH cluster (LUH) (6) For TSI: Run Time Travel Pipeline for startup time debugging on LUH cluster (SUH + TSI) (7) Rund TransferPip + FocusPip on LUH cluster (8) Test QM-IConf connection into LUH cluster (requires SSH tunneling) (9) Test Application connection into LUH cluster (requires SSH tunneling)

Please document process below! After (3) we will communicate how to work with the cluster on the project Wiki!

*required for adaptation and for identifying timing issues (extension over storm, can be further extended if needed)

eichelbe commented 7 years ago

PrioPip: SpringClient, freshly generated today by Jenkins as the symbolic link on the LUH cluster indicates.

ChristophHubeL3S commented 7 years ago

@npav: The configuration uses the default source (FocusFinancialData). I guess I will have to change to the simulated source?
This is the error that the Storm UI shows: java.lang.NullPointerException at eu.qualimaster.algorithms.imp.correlation.SpringClient.getSpringStream(SpringClient.java:49) at eu.qualimaster.focus.FocusedSpringClient.getSpringStream(FocusedSpringClient.java:52) at eu.qualimaster.FocusPip.topology.PipelineVar_7_Source1Source.nextTuple(PipelineVar_7_Source1Source.java:142) at backtype.storm.daemon.executor$fn5886$fn5901$fn__5930.invoke(executor.clj:759) at backtype.storm.util$async_loop$fn__565.invoke(util.clj:475) at clojure.lang.AFn.run(AFn.java:24) at java.lang.Thread.run(Thread.java:745)

As mentioned above, unfortunately I cannot show the full log since I have no access. But I already contacted Miroslav on this.

npav commented 7 years ago

@ChristophHubeL3S Yes, please use the SimulatedFocusFinancialData if you want the simulator functionality.

We just generated the Focus pipeline both with SpringClient and SpringClientSimulator, deployed on our cluster, and did not get any of the errors mentioned in previous comments, so it seems these errors are definitely related to the changed structure of the cluster.

eichelbe commented 7 years ago

Ok, also the accounts path is now being transferred to the workers and the configuration is initialized anyway automatically. If you do not need further settings, you may try removing DataManagementConfiguration.configure(new File("/var/nfs/qm/qm.infrastructure.cfg"));

npav commented 7 years ago

Is the PasswordStore automatically configured elsewhere? If I remember correctly, we only use this command in order for the PasswordStore to be properly configured so it can read the accounts.

Also, will it cause problems if we leave it there? Because removing it would break compatibility with other clusters.

eichelbe commented 7 years ago

Yes, by default upon every starting QM pipeline node based on the Storm cfg initialized by the QM coordination layer. This is the way we pass your options and now also the account file path.

edit: Checked on LUH cluster... it's passed to the workers now.

Let's see. The only "problem" could be a strange log message, as strange as some log messages of storm ;)

eichelbe commented 7 years ago

Ok, it seems that we are missing the external service (testing the financial part of the PrioPip)... but as far as I know this goes to Tuan ;) But it seems that we need @npav / @ap0n to get rid of another fixed path ;)

node14: 2017-01-24T21:09:18.782+0100 STDIO [ERROR] java.io.FileNotFoundException: /var/nfs/qm/tsi/external-service.properties (No such file or directory) node14: 2017-01-24T21:09:18.783+0100 STDIO [ERROR] at java.io.FileInputStream.open0(Native Method) node14: 2017-01-24T21:09:18.783+0100 STDIO [ERROR] at java.io.FileInputStream.open(FileInputStream.java:195) node14: 2017-01-24T21:09:18.783+0100 STDIO [ERROR] at java.io.FileInputStream.<init>(FileInputStream.java:138) node14: 2017-01-24T21:09:18.783+0100 STDIO [ERROR] at java.io.FileInputStream.<init>(FileInputStream.java:93) node14: 2017-01-24T21:09:18.783+0100 STDIO [ERROR] at eu.qualimaster.algorithms.imp.correlation.PriorityDataSinkForFinancialAndTwitter.readPropertiesFile(PriorityDataSinkForFinancialAndTwitter.java:52) node14: 2017-01-24T21:09:18.783+0100 STDIO [ERROR] at eu.qualimaster.algorithms.imp.correlation.PriorityDataSinkForFinancialAndTwitter.<init>(PriorityDataSinkForFinancialAndTwitter.java:44) node14: 2017-01-24T21:09:18.783+0100 STDIO [ERROR] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) node14: 2017-01-24T21:09:18.783+0100 STDIO [ERROR] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) node14: 2017-01-24T21:09:18.783+0100 STDIO [ERROR] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) node14: 2017-01-24T21:09:18.783+0100 STDIO [ERROR] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) node14: 2017-01-24T21:09:18.784+0100 STDIO [ERROR] at java.lang.Class.newInstance(Class.java:442) node14: 2017-01-24T21:09:18.784+0100 STDIO [ERROR] at eu.qualimaster.dataManagement.common.AbstractDataManager.createFallback(AbstractDataManager.java:107) node14: 2017-01-24T21:09:18.784+0100 STDIO [ERROR] at eu.qualimaster.dataManagement.common.AbstractDataManager.create(AbstractDataManager.java:137) node14: 2017-01-24T21:09:18.784+0100 STDIO [ERROR] at eu.qualimaster.dataManagement.DataManager$DataSinkManager.createDataSink(DataManager.java:218) node14: 2017-01-24T21:09:18.784+0100 STDIO [ERROR] at eu.qualimaster.PriorityPip.topology.PriorityPip_Sink0Sink.prepare(PriorityPip_Sink0Sink.java:89) node14: 2017-01-24T21:09:18.784+0100 STDIO [ERROR] at backtype.storm.daemon.executor$fn__5958$fn__5970.invoke(executor.clj:889) node14: 2017-01-24T21:09:18.784+0100 STDIO [ERROR] at backtype.storm.util$async_loop$fn__565.invoke(util.clj:473) node14: 2017-01-24T21:09:18.784+0100 STDIO [ERROR] at clojure.lang.AFn.run(AFn.java:24) node14: 2017-01-24T21:09:18.784+0100 STDIO [ERROR] at java.lang.Thread.run(Thread.java:745)

And this is potentially just a followup....

'node14: java.lang.NullPointerException: null node14: at eu.qualimaster.algorithms.imp.correlation.PriorityDataSinkForFinancialAndTwitter.disconnect(PriorityDataSinkForFinancialAndTwitter.java:155) ~[stormjar.jar:na ] node14: at eu.qualimaster.PriorityPip.topology.PriorityPip_Sink0Sink.cleanup(PriorityPip_Sink0Sink.java:118) ~[stormjar.jar:na] ... '

ap0n commented 7 years ago

If you can add a DML configuration option for external-service.properties path we can use it at the sinks. However, if the file is not found, the exception is printed but the sink falls back to the default settings (which point to the external-service that runs on softnet cluster...). The followup exception seems a bit odd...

antoine-tran commented 7 years ago

Hi, just seen the discussion now. I'm not sure when the Spring API connection configuration was decided to go to me :), but I have an idea: We probably need to change the TCP connection in DataConnector to TCP Tunnel (might add some third-party tool here).

I'm going to investigate a bit this issue now and report my findings soon.

npav commented 7 years ago

Indeed, we need to change the path, yet it does not cause any execution problems as it is now. We have set default values, and when the file is not found, the default values stay unchanged. The exception is simply caught and the trace printed (so it does not lead to the worker dying or something like that) so that we can detect the case where we forgot to include this file when setting up a new cluster.

As for the log file you sent us, the other exception seems quite weird. eu.qualimaster.algorithms.imp.correlation.SpringClientSimulator.getSpringStream(SpringClientSimulator.java:170)

SpringClientSimulator, line 170 is simply a logger statement, and a NullPointerException there, means that the logger was not initialized.

Edit: Scratch that. Apostolos added some logging statements in the code, so I was looking in the wrong line. Still, the error seems to be related to BufferedReader not being initialized (as if the "connect" method is not called).

ap0n commented 7 years ago

Hm.. of course if LUH cluster's nodes can't communicate with the Softnet cluster, the followup exception is justified. Is that the case?

antoine-tran commented 7 years ago

I saw we are having now two DataConnector (package eu.qualimaster.data.SpringConnector and eu.qualimaster.algorithms.imp.correlation.spring). Which one should I look at ? And are they available in some Github / SVN repos or only inside Maven artefacts ?

ChristophHubeL3S commented 7 years ago

FYI: I tried running the FocusPip again with SimulatedFocusFinancialData and it failed with the same exception:

java.lang.NullPointerException at eu.qualimaster.algorithms.imp.correlation.SpringClientSimulator.getSpringStream(SpringClientSimulator.java:180) at eu.qualimaster.focus.FocusedSpringClientSimulator.getSpringStream(FocusedSpringClientSimulator.java:61) at eu.qualimaster.FocusPip.topology.PipelineVar_7_Source1Source.nextTuple(PipelineVar_7_Source1Source.java:143) at backtype.storm.daemon.executor$fn5886$fn5901$fn__5930.invoke(executor.clj:759) at backtype.storm.util$async_loop$fn__565.invoke(util.clj:475) at clojure.lang.AFn.run(AFn.java:24) at java.lang.Thread.run(Thread.java:745)

ap0n commented 7 years ago

SpringClientSimulator uses the second one (eu.qualimaster.algorithms.imp.correlation.spring). I'm not sure what the other is for...

ap0n commented 7 years ago

@ChristophHubeL3S We do need some logs to see whether the configuration is passed correctly to the component. Did you generate the pipeline today? I added some log entries at the SpringClientSimulator about half an hour ago...

ChristophHubeL3S commented 7 years ago

I generated it around one hour ago, so the changes are probably not included yet. I can try again. Still struggling to access the logs.

eichelbe commented 7 years ago

Hi Tuan, probably also an external tool like plink on the client could do the job without changes to the code (SSH port forwarding). At least it helps us to see your StormUI through zerberus ;)

eichelbe commented 7 years ago

Ok , externalService.path is committed and in built by Jenkins. If not given, it defaults to DFS path. As the simulator settings, the path to the external service files is transported by the infrastructure to the workers...

antoine-tran commented 7 years ago

@All: I just sent the email with the tested tunnel to avoid public exposure here :). Please check and test your codes !

ChristophHubeL3S commented 7 years ago

@ap0n: I tried with a newly generated pipeline. Got the same error.

ap0n commented 7 years ago

@eichelbe I just committed the changes for the priority sink and moving on to the other sinks... I used

static {
    DataManagementConfiguration.configure(new File("/var/nfs/qm/qm.infrastructure.cfg"));
  }

in order to remain compatible with the rest of the clusters...

ap0n commented 7 years ago

@ChristophHubeL3S The new commit only had more logs :smile:

antoine-tran commented 7 years ago

@ap0n Did you also commit with the changes of the DataConnector ? I still did not see any requests to the Tunneling server yet

ChristophHubeL3S commented 7 years ago

@ap0n: I should have read your post more carefully :D

Just tell me when I should run another test.

npav commented 7 years ago

@antoine-tran I committed the changes to the DataConnector (also just replied on your e-mail). Yet, if this works, we still need to change it again, since it will not work for other clusters. We need to make these fields also configurable, but first, let's just see if it works as intended.

eichelbe commented 7 years ago

PriorityPip is running on LUH cluster with binary hotfixes. Discussing with Apostolos, Nick and Christoph how to fix the issues...

eichelbe commented 7 years ago

Thanks to Nick and Apostolos, Jenkins is building the pipelines with the most recent changes.

eichelbe commented 7 years ago

TimeTravePip is operating on LUH cluster...

eichelbe commented 7 years ago

.. but not completely. It seems that we have problems due to a long startup with the Zookeeper timeouts (40s). Will try to increase them this evening.

eichelbe commented 7 years ago

Ok, I gave the TimeTravelPip another try after changing the zookeeper configuration and restarting the cluster. Looks better but it is not completely working. Found several exceptions, one for each of us. If it is as for the (f)PriorityPip, solving these issues may help - however, they can be followups of dying workers, so let's get rid of all that we can get rid of. Emails with exceptions will follow ;)

eichelbe commented 7 years ago

Current state - please correct/post if wrong:

Additional measure: Catching algorithm exceptions and turning to default mode in generated code (SUH): done

eichelbe commented 7 years ago

Waiting for HDFS fix to give pipeline another try...

antoine-tran commented 7 years ago

I changed the HBase configuration from our side and pushed to github now. Waiting for the jenkins now ..

eichelbe commented 7 years ago

i.e., changing constants in code as far as I have seen? Now it may work on LUH (not tested so far, github complains about too many access, will take a while), but not on TSI anymore?

eichelbe commented 7 years ago

Waiting for the Jenkins build for a next trial with the TTPip... Let's see whether the QM-Pip build is sufficient for that... or for new exception emails.

eichelbe commented 7 years ago

After Jenkins build, DML config will allow for the two basic hbase settings and transport them to the workers...

eichelbe commented 7 years ago

Gave it a trial... two new exceptions (missing constructors), less exceptions in summary, and one because it tried to use hardware :o

eichelbe commented 7 years ago

Much better. No worker restarts within 6 min runtime. No data calculated (Sunday?). Might be running a version with simulated data could show whether it is working...

npav commented 7 years ago

That's true. Sundays have no live data, so you can give it a try with the Simulator. Let me know if you need me to provide you with a data set that you can store in your hdfs and configure the simulator to replay it from there.

eichelbe commented 7 years ago

Would the standard set for PriorityPip be sufficient. If not, please create a data set for the simulator and place it on Okeanos or better on a publicly reachable URL (just for the transfer time is fine). Let me know where it is and I will transfer it to LUH. Then we need Cui to change the pipeline configuration for the simulator...

cuiqin commented 7 years ago

Changed the source to the simulated data for the TTPipeline.

eichelbe commented 7 years ago

Thanks... waiting for the deployed pipeline.

npav commented 7 years ago

I am not sure to which data set you are referring as "standard", but since at the moment we do not care what kind of data is being fed to the pipeline, any data whose format agrees with that of the live data is fine to use.

There is a data set available on Okeano's HDFS. The two files needed are under "/user/storm/" and they are called "data.txt" and "Symbollist.txt" respectively.

eichelbe commented 7 years ago

Just to mention, LUH cluster currently "believes" that there is no hardware...

eichelbe commented 7 years ago

@npav I meant content wise, so just using that existing data set would be ok (@cuiqin don't know whether it has specific properties or so...)

cuiqin commented 7 years ago

Just to your information, I changed to the source configuration "SimulatedFinancialData".

ChristophHubeL3S commented 7 years ago

I started the FocusPip on Okeanos. The pipeline starts without problems, but after the start nothing is happening: http://snf-618466.vm.okeanos.grnet.gr:8080/topology.html?id=FocusPip-85-1485786177 Is this because of the configuration changes?

btw: I used the SimulatedFocusFinancialData source.

Edit: It is trying to connect to the LUH cluster: 2017-01-30T16:24:49.845+0200 o.a.h.i.Client [INFO] Retrying connect to server: hadoop2.kbs.uni-hannover.de/130.75.152.117:8020. Already tried 0 time(s); maxRetries=45

What do I have to change?

eichelbe commented 7 years ago

An adjustment made the SimulatedFinancialData source more stable... not sure about the SimulatedFocusFinancialData source (probably still the old code)

npav commented 7 years ago

@eichelbe Actually the SimulatedFocusFinancialData source uses the SimulatedFinancialData source (SpringClientSimulator) internally, so changes to that one, will also be used by the focus one.

@ChristophHubeL3S In the focus pipeline, for the financial source, you need to add market players to the filter, else nothing gets emitted. As for the twitter source, have you re-uploaded your data set on HDFS after the data were deleted due to the hack? I would guess that it does not emit anything due to it not finding data.

npav commented 7 years ago

@ChristophHubeL3S In case you don't remember from the past, you can add market players using the following syntax: ./cli.sh setParam FocusPip SpringDataSource playerList addmarketPlayer/1,2,3,...

where 1,2,3,... are market player IDs and can be as many as you need (e.g. from 1 to 200 if you want to add the first 200 MPs).