epam / OSCI

Open Source Contributor Index
https://opensourceindex.io/
GNU General Public License v3.0
160 stars 95 forks source link

Unable to get basic example to run #120

Open theycallmeswift opened 2 years ago

theycallmeswift commented 2 years ago

Hey, folks --

I'm having trouble getting the basic example provided to run. Specifically the failure I'm encountering is at the daily-osci-rankings stage. I have confirmed that I have a functioning local version of Hadoop installed. Running on Ubuntu 20.04 LTS VPS with a fresh install.

I pulled the two most visible errors from the log out below (full log expandable at bottom of issue). It's unclear to me if they are related though.

Any help pointing me in the right direction would be appreciated!

$ python3 osci-cli.py get-github-daily-push-events -d 2020-01-01
# success
$ python3 osci-cli.py process-github-daily-push-events -d 2020-01-01
# success
$ python3 osci-cli.py daily-osci-rankings -td 2020-01-02
# failure (see full log below)

# ...

[2022-03-22 18:11:11,850] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;\n at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)\n at scala.Option.getOrElse(Option.scala:189)\n   at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)\n at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)\n    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)\n   at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)\n    at scala.Option.getOrElse(Option.scala:189)\n   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)\n   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n  at java.lang.reflect.Method.invoke(Method.java:498)\n   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n at py4j.Gateway.invoke(Gateway.java:282)\n  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n   at py4j.commands.CallCommand.execute(CallCommand.java:79)\n at py4j.GatewayConnection.run(GatewayConnection.java:238)\n at java.lang.Thread.run(Thread.java:748)\n
<osci.datalake.local.landing.LocalLandingArea object at 0x7fa5e8753f40> /data landing
<osci.datalake.local.staging.LocalStagingArea object at 0x7fa5e87609a0> /data staging
<osci.datalake.local.public.LocalPublicArea object at 0x7fa5e8760940> /data public
<osci.datalake.local.web.LocalWebArea object at 0x7fa5e8760a90> /web data

# ...

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
Traceback (most recent call last):
  File "osci-cli.py", line 93, in <module>
    cli(standalone_mode=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/OSCI/osci/actions/base.py", line 59, in execute
    return self._execute(**self._process_params(kwargs))
  File "/home/ubuntu/OSCI/osci/actions/process/generate_daily_osci_rankings.py", line 49, in _execute
    commits = osci_ranking_job.extract(to_date=to_day).cache()
  File "/home/ubuntu/OSCI/osci/jobs/base.py", line 44, in extract
    commits=Session().load_dataframe(paths=self._get_dataset_paths(to_date, from_date))
  File "/home/ubuntu/OSCI/osci/jobs/session.py", line 39, in load_dataframe
    return self.spark_session.read.load(paths, **options)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 182, in load
    return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
Full Error Log: ``` [2022-03-22 18:11:05,996] [INFO] ENV: None [2022-03-22 18:11:05,997] [DEBUG] Check config file for env local exists [2022-03-22 18:11:05,997] [DEBUG] Read config from /home/ubuntu/OSCI/osci/config/files/local.yml [2022-03-22 18:11:06,000] [DEBUG] Prod yml load: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}} [2022-03-22 18:11:06,000] [DEBUG] Prod yml res: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}} [2022-03-22 18:11:06,000] [INFO] Full config: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}} [2022-03-22 18:11:06,000] [INFO] Configuration loaded for env: local [2022-03-22 18:11:06,000] [DEBUG] Create new [2022-03-22 18:11:06,000] [DEBUG] {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'} [2022-03-22 18:11:06,000] [DEBUG] Create new [2022-03-22 18:11:06,000] [DEBUG] Create new [2022-03-22 18:11:06,113] [INFO] Execute action `daily-osci-rankings` [2022-03-22 18:11:06,113] [INFO] Action params `{'to_day': '2020-01-02'}` [2022-03-22 18:11:06,114] [DEBUG] Create new [2022-03-22 18:11:06,114] [DEBUG] Create new [2022-03-22 18:11:06,114] [DEBUG] Create new [2022-03-22 18:11:06,115] [DEBUG] Loaded paths for (None 2020-01-02 00:00:00) [] Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). [2022-03-22 18:11:08,127] [DEBUG] Command to send: A fb324a0d50b599ec733f3b3b1bc1d7f4d1c894100f14d4ad6f4af9db025d37ea [2022-03-22 18:11:08,142] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,142] [DEBUG] Command to send: j i rj org.apache.spark.SparkConf e [2022-03-22 18:11:08,143] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,143] [DEBUG] Command to send: j i rj org.apache.spark.api.java.* e [2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,144] [DEBUG] Command to send: j i rj org.apache.spark.api.python.* e [2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,144] [DEBUG] Command to send: j i rj org.apache.spark.ml.python.* e [2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,144] [DEBUG] Command to send: j i rj org.apache.spark.mllib.api.python.* e [2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,144] [DEBUG] Command to send: j i rj org.apache.spark.sql.* e [2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,144] [DEBUG] Command to send: j i rj org.apache.spark.sql.api.python.* e [2022-03-22 18:11:08,145] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,145] [DEBUG] Command to send: j i rj org.apache.spark.sql.hive.* e [2022-03-22 18:11:08,146] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,146] [DEBUG] Command to send: j i rj scala.Tuple2 e [2022-03-22 18:11:08,146] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,146] [DEBUG] Command to send: r u SparkConf rj e [2022-03-22 18:11:08,147] [DEBUG] Answer received: !ycorg.apache.spark.SparkConf [2022-03-22 18:11:08,148] [DEBUG] Command to send: i org.apache.spark.SparkConf bTrue e [2022-03-22 18:11:08,154] [DEBUG] Answer received: !yro0 [2022-03-22 18:11:08,154] [DEBUG] Command to send: c o0 contains sspark.serializer.objectStreamReset e [2022-03-22 18:11:08,158] [DEBUG] Answer received: !ybfalse [2022-03-22 18:11:08,158] [DEBUG] Command to send: c o0 set sspark.serializer.objectStreamReset s100 e [2022-03-22 18:11:08,158] [DEBUG] Answer received: !yro1 [2022-03-22 18:11:08,158] [DEBUG] Command to send: m d o1 e [2022-03-22 18:11:08,159] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,159] [DEBUG] Command to send: c o0 contains sspark.rdd.compress e [2022-03-22 18:11:08,159] [DEBUG] Answer received: !ybfalse [2022-03-22 18:11:08,159] [DEBUG] Command to send: c o0 set sspark.rdd.compress sTrue e [2022-03-22 18:11:08,159] [DEBUG] Answer received: !yro2 [2022-03-22 18:11:08,159] [DEBUG] Command to send: m d o2 e [2022-03-22 18:11:08,159] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,160] [DEBUG] Command to send: c o0 contains sspark.master e [2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue [2022-03-22 18:11:08,160] [DEBUG] Command to send: c o0 contains sspark.app.name e [2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue [2022-03-22 18:11:08,160] [DEBUG] Command to send: c o0 contains sspark.master e [2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue [2022-03-22 18:11:08,160] [DEBUG] Command to send: c o0 get sspark.master e [2022-03-22 18:11:08,161] [DEBUG] Answer received: !yslocal[*] [2022-03-22 18:11:08,161] [DEBUG] Command to send: c o0 contains sspark.app.name e [2022-03-22 18:11:08,162] [DEBUG] Answer received: !ybtrue [2022-03-22 18:11:08,162] [DEBUG] Command to send: c o0 get sspark.app.name e [2022-03-22 18:11:08,162] [DEBUG] Answer received: !yspyspark-shell [2022-03-22 18:11:08,162] [DEBUG] Command to send: c o0 contains sspark.home e [2022-03-22 18:11:08,163] [DEBUG] Answer received: !ybfalse [2022-03-22 18:11:08,163] [DEBUG] Command to send: c o0 getAll e [2022-03-22 18:11:08,163] [DEBUG] Answer received: !yto3 [2022-03-22 18:11:08,163] [DEBUG] Command to send: a e o3 e [2022-03-22 18:11:08,164] [DEBUG] Answer received: !yi7 [2022-03-22 18:11:08,164] [DEBUG] Command to send: a g o3 i0 e [2022-03-22 18:11:08,164] [DEBUG] Answer received: !yro4 [2022-03-22 18:11:08,164] [DEBUG] Command to send: c o4 _1 e [2022-03-22 18:11:08,165] [DEBUG] Answer received: !ysspark.rdd.compress [2022-03-22 18:11:08,165] [DEBUG] Command to send: c o4 _2 e [2022-03-22 18:11:08,165] [DEBUG] Answer received: !ysTrue [2022-03-22 18:11:08,166] [DEBUG] Command to send: a e o3 e [2022-03-22 18:11:08,166] [DEBUG] Answer received: !yi7 [2022-03-22 18:11:08,166] [DEBUG] Command to send: a g o3 i1 e [2022-03-22 18:11:08,166] [DEBUG] Answer received: !yro5 [2022-03-22 18:11:08,166] [DEBUG] Command to send: c o5 _1 e [2022-03-22 18:11:08,166] [DEBUG] Answer received: !ysspark.serializer.objectStreamReset [2022-03-22 18:11:08,167] [DEBUG] Command to send: c o5 _2 e [2022-03-22 18:11:08,167] [DEBUG] Answer received: !ys100 [2022-03-22 18:11:08,167] [DEBUG] Command to send: a e o3 e [2022-03-22 18:11:08,167] [DEBUG] Answer received: !yi7 [2022-03-22 18:11:08,167] [DEBUG] Command to send: a g o3 i2 e [2022-03-22 18:11:08,167] [DEBUG] Answer received: !yro6 [2022-03-22 18:11:08,167] [DEBUG] Command to send: c o6 _1 e [2022-03-22 18:11:08,170] [DEBUG] Answer received: !ysspark.master [2022-03-22 18:11:08,170] [DEBUG] Command to send: c o6 _2 e [2022-03-22 18:11:08,171] [DEBUG] Answer received: !yslocal[*] [2022-03-22 18:11:08,171] [DEBUG] Command to send: a e o3 e [2022-03-22 18:11:08,171] [DEBUG] Answer received: !yi7 [2022-03-22 18:11:08,171] [DEBUG] Command to send: a g o3 i3 e [2022-03-22 18:11:08,171] [DEBUG] Answer received: !yro7 [2022-03-22 18:11:08,171] [DEBUG] Command to send: c o7 _1 e [2022-03-22 18:11:08,172] [DEBUG] Answer received: !ysspark.submit.pyFiles [2022-03-22 18:11:08,172] [DEBUG] Command to send: c o7 _2 e [2022-03-22 18:11:08,172] [DEBUG] Answer received: !ys [2022-03-22 18:11:08,172] [DEBUG] Command to send: a e o3 e [2022-03-22 18:11:08,172] [DEBUG] Answer received: !yi7 [2022-03-22 18:11:08,172] [DEBUG] Command to send: a g o3 i4 e [2022-03-22 18:11:08,173] [DEBUG] Answer received: !yro8 [2022-03-22 18:11:08,173] [DEBUG] Command to send: c o8 _1 e [2022-03-22 18:11:08,173] [DEBUG] Answer received: !ysspark.submit.deployMode [2022-03-22 18:11:08,173] [DEBUG] Command to send: c o8 _2 e [2022-03-22 18:11:08,173] [DEBUG] Answer received: !ysclient [2022-03-22 18:11:08,173] [DEBUG] Command to send: a e o3 e [2022-03-22 18:11:08,173] [DEBUG] Answer received: !yi7 [2022-03-22 18:11:08,173] [DEBUG] Command to send: a g o3 i5 e [2022-03-22 18:11:08,173] [DEBUG] Answer received: !yro9 [2022-03-22 18:11:08,174] [DEBUG] Command to send: c o9 _1 e [2022-03-22 18:11:08,174] [DEBUG] Answer received: !ysspark.ui.showConsoleProgress [2022-03-22 18:11:08,174] [DEBUG] Command to send: c o9 _2 e [2022-03-22 18:11:08,174] [DEBUG] Answer received: !ystrue [2022-03-22 18:11:08,174] [DEBUG] Command to send: a e o3 e [2022-03-22 18:11:08,174] [DEBUG] Answer received: !yi7 [2022-03-22 18:11:08,174] [DEBUG] Command to send: a g o3 i6 e [2022-03-22 18:11:08,174] [DEBUG] Answer received: !yro10 [2022-03-22 18:11:08,175] [DEBUG] Command to send: c o10 _1 e [2022-03-22 18:11:08,175] [DEBUG] Answer received: !ysspark.app.name [2022-03-22 18:11:08,175] [DEBUG] Command to send: c o10 _2 e [2022-03-22 18:11:08,175] [DEBUG] Answer received: !yspyspark-shell [2022-03-22 18:11:08,175] [DEBUG] Command to send: a e o3 e [2022-03-22 18:11:08,175] [DEBUG] Answer received: !yi7 [2022-03-22 18:11:08,175] [DEBUG] Command to send: m d o3 e [2022-03-22 18:11:08,175] [DEBUG] Answer received: !yv [2022-03-22 18:11:08,175] [DEBUG] Command to send: r u JavaSparkContext rj e [2022-03-22 18:11:08,186] [DEBUG] Answer received: !ycorg.apache.spark.api.java.JavaSparkContext [2022-03-22 18:11:08,186] [DEBUG] Command to send: i org.apache.spark.api.java.JavaSparkContext ro0 e [2022-03-22 18:11:09,483] [DEBUG] Answer received: !yro11 [2022-03-22 18:11:09,483] [DEBUG] Command to send: c o11 sc e [2022-03-22 18:11:09,489] [DEBUG] Answer received: !yro12 [2022-03-22 18:11:09,490] [DEBUG] Command to send: c o12 conf e [2022-03-22 18:11:09,499] [DEBUG] Answer received: !yro13 [2022-03-22 18:11:09,500] [DEBUG] Command to send: r u PythonAccumulatorV2 rj e [2022-03-22 18:11:09,501] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonAccumulatorV2 [2022-03-22 18:11:09,502] [DEBUG] Command to send: i org.apache.spark.api.python.PythonAccumulatorV2 s127.0.0.1 i45879 sfb324a0d50b599ec733f3b3b1bc1d7f4d1c894100f14d4ad6f4af9db025d37ea e [2022-03-22 18:11:09,502] [DEBUG] Answer received: !yro14 [2022-03-22 18:11:09,502] [DEBUG] Command to send: c o11 sc e [2022-03-22 18:11:09,502] [DEBUG] Answer received: !yro15 [2022-03-22 18:11:09,503] [DEBUG] Command to send: c o15 register ro14 e [2022-03-22 18:11:09,505] [DEBUG] Answer received: !yv [2022-03-22 18:11:09,505] [DEBUG] Command to send: r u PythonUtils rj e [2022-03-22 18:11:09,506] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonUtils [2022-03-22 18:11:09,506] [DEBUG] Command to send: r m org.apache.spark.api.python.PythonUtils isEncryptionEnabled e [2022-03-22 18:11:09,506] [DEBUG] Answer received: !ym [2022-03-22 18:11:09,506] [DEBUG] Command to send: c z:org.apache.spark.api.python.PythonUtils isEncryptionEnabled ro11 e [2022-03-22 18:11:09,507] [DEBUG] Answer received: !ybfalse [2022-03-22 18:11:09,508] [DEBUG] Command to send: r u org rj e [2022-03-22 18:11:09,509] [DEBUG] Answer received: !yp [2022-03-22 18:11:09,510] [DEBUG] Command to send: r u org.apache rj e [2022-03-22 18:11:09,510] [DEBUG] Answer received: !yp [2022-03-22 18:11:09,510] [DEBUG] Command to send: r u org.apache.spark rj e [2022-03-22 18:11:09,510] [DEBUG] Answer received: !yp [2022-03-22 18:11:09,511] [DEBUG] Command to send: r u org.apache.spark.SparkFiles rj e [2022-03-22 18:11:09,511] [DEBUG] Answer received: !ycorg.apache.spark.SparkFiles [2022-03-22 18:11:09,511] [DEBUG] Command to send: r m org.apache.spark.SparkFiles getRootDirectory e [2022-03-22 18:11:09,511] [DEBUG] Answer received: !ym [2022-03-22 18:11:09,511] [DEBUG] Command to send: c z:org.apache.spark.SparkFiles getRootDirectory e [2022-03-22 18:11:09,512] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda/userFiles-58b63090-eb7f-4872-8939-2710678287d1 [2022-03-22 18:11:09,512] [DEBUG] Command to send: c o13 get sspark.submit.pyFiles s e [2022-03-22 18:11:09,512] [DEBUG] Answer received: !ys [2022-03-22 18:11:09,513] [DEBUG] Command to send: r u org rj e [2022-03-22 18:11:09,514] [DEBUG] Answer received: !yp [2022-03-22 18:11:09,514] [DEBUG] Command to send: r u org.apache rj e [2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp [2022-03-22 18:11:09,515] [DEBUG] Command to send: r u org.apache.spark rj e [2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp [2022-03-22 18:11:09,515] [DEBUG] Command to send: r u org.apache.spark.util rj e [2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp [2022-03-22 18:11:09,516] [DEBUG] Command to send: r u org.apache.spark.util.Utils rj e [2022-03-22 18:11:09,517] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils [2022-03-22 18:11:09,517] [DEBUG] Command to send: r m org.apache.spark.util.Utils getLocalDir e [2022-03-22 18:11:09,519] [DEBUG] Answer received: !ym [2022-03-22 18:11:09,519] [DEBUG] Command to send: c o11 sc e [2022-03-22 18:11:09,519] [DEBUG] Answer received: !yro16 [2022-03-22 18:11:09,519] [DEBUG] Command to send: c o16 conf e [2022-03-22 18:11:09,520] [DEBUG] Answer received: !yro17 [2022-03-22 18:11:09,520] [DEBUG] Command to send: c z:org.apache.spark.util.Utils getLocalDir ro17 e [2022-03-22 18:11:09,520] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda [2022-03-22 18:11:09,520] [DEBUG] Command to send: r u org rj e [2022-03-22 18:11:09,521] [DEBUG] Answer received: !yp [2022-03-22 18:11:09,521] [DEBUG] Command to send: r u org.apache rj e [2022-03-22 18:11:09,522] [DEBUG] Answer received: !yp [2022-03-22 18:11:09,522] [DEBUG] Command to send: r u org.apache.spark rj e [2022-03-22 18:11:09,522] [DEBUG] Answer received: !yp [2022-03-22 18:11:09,522] [DEBUG] Command to send: r u org.apache.spark.util rj e [2022-03-22 18:11:09,523] [DEBUG] Answer received: !yp [2022-03-22 18:11:09,523] [DEBUG] Command to send: r u org.apache.spark.util.Utils rj e [2022-03-22 18:11:09,523] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils [2022-03-22 18:11:09,523] [DEBUG] Command to send: r m org.apache.spark.util.Utils createTempDir e [2022-03-22 18:11:09,523] [DEBUG] Answer received: !ym [2022-03-22 18:11:09,524] [DEBUG] Command to send: c z:org.apache.spark.util.Utils createTempDir s/tmp/spark-133764be-4844-4a91-a340-210c1b419fda spyspark e [2022-03-22 18:11:09,524] [DEBUG] Answer received: !yro18 [2022-03-22 18:11:09,524] [DEBUG] Command to send: c o18 getAbsolutePath e [2022-03-22 18:11:09,525] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda/pyspark-bc66966b-69a0-4a5b-b7ab-b0b7c8e45101 [2022-03-22 18:11:09,525] [DEBUG] Command to send: c o13 get sspark.python.profile sfalse e [2022-03-22 18:11:09,525] [DEBUG] Answer received: !ysfalse [2022-03-22 18:11:09,525] [DEBUG] Command to send: r u SparkSession rj e [2022-03-22 18:11:09,544] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession [2022-03-22 18:11:09,545] [DEBUG] Command to send: r m org.apache.spark.sql.SparkSession getDefaultSession e [2022-03-22 18:11:09,567] [DEBUG] Answer received: !ym [2022-03-22 18:11:09,567] [DEBUG] Command to send: c z:org.apache.spark.sql.SparkSession getDefaultSession e [2022-03-22 18:11:09,568] [DEBUG] Answer received: !yro19 [2022-03-22 18:11:09,568] [DEBUG] Command to send: c o19 isDefined e [2022-03-22 18:11:09,569] [DEBUG] Answer received: !ybfalse [2022-03-22 18:11:09,569] [DEBUG] Command to send: r u SparkSession rj e [2022-03-22 18:11:09,570] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession [2022-03-22 18:11:09,570] [DEBUG] Command to send: c o11 sc e [2022-03-22 18:11:09,571] [DEBUG] Answer received: !yro20 [2022-03-22 18:11:09,571] [DEBUG] Command to send: i org.apache.spark.sql.SparkSession ro20 e [2022-03-22 18:11:09,620] [DEBUG] Answer received: !yro21 [2022-03-22 18:11:09,620] [DEBUG] Command to send: c o21 sqlContext e [2022-03-22 18:11:09,621] [DEBUG] Answer received: !yro22 [2022-03-22 18:11:09,621] [DEBUG] Command to send: r u SparkSession rj e [2022-03-22 18:11:09,622] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession [2022-03-22 18:11:09,622] [DEBUG] Command to send: r m org.apache.spark.sql.SparkSession setDefaultSession e [2022-03-22 18:11:09,623] [DEBUG] Answer received: !ym [2022-03-22 18:11:09,623] [DEBUG] Command to send: c z:org.apache.spark.sql.SparkSession setDefaultSession ro21 e [2022-03-22 18:11:09,623] [DEBUG] Answer received: !yv [2022-03-22 18:11:09,623] [DEBUG] Command to send: r u SparkSession rj e [2022-03-22 18:11:09,624] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession [2022-03-22 18:11:09,624] [DEBUG] Command to send: r m org.apache.spark.sql.SparkSession setActiveSession e [2022-03-22 18:11:09,624] [DEBUG] Answer received: !ym [2022-03-22 18:11:09,624] [DEBUG] Command to send: c z:org.apache.spark.sql.SparkSession setActiveSession ro21 e [2022-03-22 18:11:09,625] [DEBUG] Answer received: !yv [2022-03-22 18:11:09,625] [DEBUG] Command to send: c o22 read e [2022-03-22 18:11:10,432] [DEBUG] Answer received: !yro23 [2022-03-22 18:11:10,432] [DEBUG] Command to send: r u PythonUtils rj e [2022-03-22 18:11:10,433] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonUtils [2022-03-22 18:11:10,433] [DEBUG] Command to send: r m org.apache.spark.api.python.PythonUtils toSeq e [2022-03-22 18:11:10,433] [DEBUG] Answer received: !ym [2022-03-22 18:11:10,433] [DEBUG] Command to send: i java.util.ArrayList e [2022-03-22 18:11:10,433] [DEBUG] Answer received: !ylo24 [2022-03-22 18:11:10,434] [DEBUG] Command to send: c z:org.apache.spark.api.python.PythonUtils toSeq ro24 e [2022-03-22 18:11:10,434] [DEBUG] Answer received: !yro25 [2022-03-22 18:11:10,434] [DEBUG] Command to send: m d o24 e [2022-03-22 18:11:10,435] [DEBUG] Answer received: !yv [2022-03-22 18:11:10,435] [DEBUG] Command to send: c o23 load ro25 e 22/03/22 18:11:10 WARN DataSource: All paths were ignored: [Stage 0:> (0 + 1) / 1] [2022-03-22 18:11:11,839] [DEBUG] Answer received: !xro26 [2022-03-22 18:11:11,839] [DEBUG] Command to send: c o26 toString e [2022-03-22 18:11:11,840] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; [2022-03-22 18:11:11,840] [DEBUG] Command to send: c o26 getCause e [2022-03-22 18:11:11,840] [DEBUG] Answer received: !yn [2022-03-22 18:11:11,840] [DEBUG] Command to send: r u org rj e [2022-03-22 18:11:11,842] [DEBUG] Answer received: !yp [2022-03-22 18:11:11,842] [DEBUG] Command to send: r u org.apache rj e [2022-03-22 18:11:11,844] [DEBUG] Answer received: !yp [2022-03-22 18:11:11,844] [DEBUG] Command to send: r u org.apache.spark rj e [2022-03-22 18:11:11,848] [DEBUG] Answer received: !yp [2022-03-22 18:11:11,848] [DEBUG] Command to send: r u org.apache.spark.util rj e [2022-03-22 18:11:11,849] [DEBUG] Answer received: !yp [2022-03-22 18:11:11,849] [DEBUG] Command to send: r u org.apache.spark.util.Utils rj e [2022-03-22 18:11:11,849] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils [2022-03-22 18:11:11,849] [DEBUG] Command to send: r m org.apache.spark.util.Utils exceptionString e [2022-03-22 18:11:11,849] [DEBUG] Answer received: !ym [2022-03-22 18:11:11,849] [DEBUG] Command to send: c z:org.apache.spark.util.Utils exceptionString ro26 e [2022-03-22 18:11:11,850] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;\n at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)\n at scala.Option.getOrElse(Option.scala:189)\n at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)\n at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)\n at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)\n at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)\n at scala.Option.getOrElse(Option.scala:189)\n at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)\n at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n at java.lang.reflect.Method.invoke(Method.java:498)\n at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n at py4j.Gateway.invoke(Gateway.java:282)\n at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n at py4j.commands.CallCommand.execute(CallCommand.java:79)\n at py4j.GatewayConnection.run(GatewayConnection.java:238)\n at java.lang.Thread.run(Thread.java:748)\n /data landing /data staging /data public /web data [2022-03-22 18:11:11,852] [DEBUG] Command to send: m d o0 e [2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,853] [DEBUG] Command to send: m d o4 e [2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,853] [DEBUG] Command to send: m d o5 e [2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,853] [DEBUG] Command to send: m d o6 e [2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,853] [DEBUG] Command to send: m d o7 e [2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,853] [DEBUG] Command to send: m d o8 e [2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,854] [DEBUG] Command to send: m d o9 e [2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,854] [DEBUG] Command to send: m d o10 e [2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,854] [DEBUG] Command to send: m d o12 e [2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,854] [DEBUG] Command to send: m d o15 e [2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,854] [DEBUG] Command to send: m d o16 e [2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,855] [DEBUG] Command to send: m d o17 e [2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,855] [DEBUG] Command to send: m d o18 e [2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,855] [DEBUG] Command to send: m d o19 e [2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,855] [DEBUG] Command to send: m d o20 e [2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv Traceback (most recent call last): File "osci-cli.py", line 93, in cli(standalone_mode=False) File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/home/ubuntu/OSCI/osci/actions/base.py", line 59, in execute return self._execute(**self._process_params(kwargs)) File "/home/ubuntu/OSCI/osci/actions/process/generate_daily_osci_rankings.py", line 49, in _execute commits = osci_ranking_job.extract(to_date=to_day).cache() File "/home/ubuntu/OSCI/osci/jobs/base.py", line 44, in extract commits=Session().load_dataframe(paths=self._get_dataset_paths(to_date, from_date)) File "/home/ubuntu/OSCI/osci/jobs/session.py", line 39, in load_dataframe return self.spark_session.read.load(paths, **options) File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 182, in load return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path))) File "/home/ubuntu/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__ return_value = get_return_value( File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 134, in deco raise_from(converted) File "", line 3, in raise_from pyspark.sql.utils.AnalysisException[2022-03-22 18:11:11,879] [DEBUG] Command to send: r u org rj e [2022-03-22 18:11:11,881] [DEBUG] Answer received: !yp [2022-03-22 18:11:11,881] [DEBUG] Command to send: r u org.apache rj e [2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp [2022-03-22 18:11:11,882] [DEBUG] Command to send: r u org.apache.spark rj e [2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp [2022-03-22 18:11:11,882] [DEBUG] Command to send: r u org.apache.spark.sql rj e [2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp [2022-03-22 18:11:11,882] [DEBUG] Command to send: r u org.apache.spark.sql.internal rj e [2022-03-22 18:11:11,883] [DEBUG] Answer received: !yp [2022-03-22 18:11:11,883] [DEBUG] Command to send: r u org.apache.spark.sql.internal.SQLConf rj e [2022-03-22 18:11:11,883] [DEBUG] Answer received: !ycorg.apache.spark.sql.internal.SQLConf [2022-03-22 18:11:11,883] [DEBUG] Command to send: r m org.apache.spark.sql.internal.SQLConf get e [2022-03-22 18:11:11,885] [DEBUG] Answer received: !ym [2022-03-22 18:11:11,885] [DEBUG] Command to send: c z:org.apache.spark.sql.internal.SQLConf get e [2022-03-22 18:11:11,885] [DEBUG] Answer received: !yro27 [2022-03-22 18:11:11,885] [DEBUG] Command to send: c o27 pysparkJVMStacktraceEnabled e [2022-03-22 18:11:11,886] [DEBUG] Answer received: !ybfalse : Unable to infer schema for Parquet. It must be specified manually.; [2022-03-22 18:11:11,924] [DEBUG] Command to send: m d o27 e [2022-03-22 18:11:11,927] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,965] [DEBUG] Command to send: m d o26 e [2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,966] [DEBUG] Command to send: m d o25 e [2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv [2022-03-22 18:11:11,966] [DEBUG] Command to send: m d o23 e [2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv ```
theycallmeswift commented 2 years ago

@cm-howard any thoughts on this? Alternatively, would appreciate anything you could do to point me in the right direction

vlad-isayko commented 2 years ago

@theycallmeswift are there any files in the '/data' dir?

theycallmeswift commented 2 years ago

@vlad-isayko yep!

python3 osci-cli.py get-github-daily-push-events -d YYYY-MM-DD produces YYYY-MM-DD-[0-23].parquet files in /data/landing/github/events/push/YYYY/MM/DD/

and

python3 osci-cli.py process-github-daily-push-events -d YYYY-MM-DD produces COMPANY-YYYY-MM-DD.parquet files in /data/staging/github/raw-events/push/YYYY/MM/DD

jerpelea commented 2 years ago

@theycallmeswift I have a similar error on Ubuntu 20.04 Did you manage to fix the error locally?

theycallmeswift commented 2 years ago

@jerpelea I did not unfortunately. The docs need a serious overhaul from someone who knows the system better than me!

vlad-isayko commented 2 years ago

@theycallmeswift @jerpelea Hello, the problem is really outdated and incomplete documentation. We will fix this in the coming days. I'll keep you posted

jerpelea commented 2 years ago

@vlad-isayko can you share some quick update here before updating the documentation

vlad-isayko commented 2 years ago

At the moment, this is the current way to start

  1. python3 osci-cli.py get-github-daily-push-events -d 2020-01-01
  2. python3 osci-cli.py process-github-daily-push-events -d 2020-01-01
  3. python3 osci-cli.py daily-active-repositories -d 2020-01-01
  4. python3 osci-cli.py load-repositories -d 2020-01-01
  5. python3 osci-cli.py filter-unlicensed -d 2020-01-01
  6. python3 osci-cli.py daily-osci-rankings -td 2020-01-01
  7. python3 osci-cli.py get-change-report -d 2020-01-01

You can write to me if you have any problems

jerpelea commented 2 years ago

@vlad-isayko

Thanks for your quick answer

Everything behaved normal until step 6 python3 osci-cli.py daily-osci-rankings -td 2020-01-01

attached is the log log.log

I am running Ubuntu 20.04 with python 3.8

vlad-isayko commented 2 years ago

@jerpelea can you also share what version of pyspark and spark do you have?

jerpelea commented 2 years ago

@vlad-isayko

packages from .local/lib/python3.8/site-packages installed by pip install -r requirements.txt

aiohttp-3.8.1.dist-info aiosignal-1.2.0.dist-info async_timeout-4.0.2.dist-info attrs-21.4.0.dist-info azure_common-1.1.25.dist-info azure_core-1.7.0.dist-info azure_functions-1.3.0.dist-info azure_functions_durable-1.1.3.dist-info azure_nspkg-3.0.2.dist-info azure_storage_blob-12.3.2.dist-info azure_storage_common-2.1.0.dist-info azure_storage_nspkg-3.1.0.dist-info cachetools-4.2.4.dist-info charset_normalizer-2.0.12.dist-info click-7.1.2.dist-info deepmerge-0.1.1.dist-info frozenlist-1.3.0.dist-info furl-2.1.3.dist-info google_api_core-1.31.5.dist-info googleapis_common_protos-1.56.1.dist-info google_auth-1.35.0.dist-info google_cloud_bigquery-1.25.0.dist-info google_cloud_core-1.7.2.dist-info google_resumable_media-0.5.1.dist-info iniconfig-1.1.1.dist-info isodate-0.6.1.dist-info Jinja2-2.11.3.dist-info MarkupSafe-2.0.1.dist-info more_itertools-8.13.0.dist-info msrest-0.6.21.dist-info multidict-6.0.2.dist-info numpy-1.19.5.dist-info orderedmultidict-1.0.1.dist-info packaging-21.3.dist-info pandas-1.0.3.dist-info pbr-5.9.0.dist-info pip-22.1.2.dist-info pluggy-0.13.1.dist-info protobuf-4.21.1.dist-info py-1.11.0.dist-info py4j-0.10.9.dist-info pyarrow-0.17.1.dist-info pyasn1-0.4.8.dist-info pyasn1_modules-0.2.8.dist-info pypandoc-1.5.dist-info pyparsing-3.0.9.dist-info pyspark-3.0.1.dist-info pytest-6.0.1.dist-info python_dateutil-2.8.1.dist-info PyYAML-5.4.dist-info requests_oauthlib-1.3.1.dist-info rsa-4.8.dist-info six-1.13.0.dist-info testresources-2.0.1.dist-info toml-0.10.2.dist-info XlsxWriter-1.2.3.dist-inf

vlad-isayko commented 2 years ago

@jerpelea may be there are some problems with parquet file. We need to check it

jerpelea commented 2 years ago

@vlad-isayko what version are you using? Do you have any suggestions how to check it?

vlad-isayko commented 2 years ago

@jerpelea we use the same libraries with the same versions. Can you share some files that generated in staging area?

jerpelea commented 2 years ago

@vlad-isayko thanks for your quick answer Here is the file repository-2021-01-01.zip

vlad-isayko commented 2 years ago

@jerpelea

Is there any files in /staging/github/events/push/2021/01/01/?

Before step 6 there should be files in directories:

jerpelea commented 2 years ago

@vlad-isayko I have /landing/githug/events/push/2021/01/01/ /staging/github/raw-events/push/2021/01/01/ /staging/github/repository/2021/01/

there is no /staging/github/events/push/2021/01/01/

Thanks

vlad-isayko commented 2 years ago

@jerpelea

Can you rerun step 5 python3 osci-cli.py filter-unlicensed -d 2020-01-01 and share logs from this command?

I think that there some problem at this step.

jerpelea commented 2 years ago

@vlad-isayko attached are the log file and some result files

filter-unlicensed.zip github.zip

thanks

vlad-isayko commented 2 years ago

@jerpelea

Ok, it's strange that repository file in staging is empty... Is there this file /landing/github/repository/2021/01/2021-01-01.csv? Can you share it?

jerpelea commented 2 years ago

2021-01-01.zip @vlad-isayko

vlad-isayko commented 2 years ago

@jerpelea

So the error occurred at step 4 when getting information about the repositories from the Github API.

I ran this step on my own with your source file and I will then check the output.

Could you check your config for a valid github api token?

github:
  token: '394***************************************77'
jerpelea commented 2 years ago

@vlad-isayko thanks for pointing it out I think that token setup is a missing step on the README I added the token in local.yml and restarted step 4

this is how the logs look now [2022-06-13 09:42:38,265] [INFO] Get repository MinCiencia/Datos-COVID19 information [2022-06-13 09:42:38,265] [DEBUG] Make request to Github API method=GET, url=https://api.github.com/repos/MinCiencia/Datos-COVID19, kwargs={} [2022-06-13 09:42:38,485] [DEBUG] https://api.github.com:443 "GET /repos/MinCiencia/Datos-COVID19 HTTP/1.1" 200 None [2022-06-13 09:42:38,486] [DEBUG] Get response[200] from Github API method=GET, url=https://api.github.com/repos/MinCiencia/Datos-COVID19, kwargs={'headers': {'Authorization': 'token gxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxdo'}}

I will keep you updated on the progress Thanks for support

jerpelea commented 2 years ago

@vlad-isayko new errors at step6 daily-osci-rankings.zip

vlad-isayko commented 2 years ago

@jerpelea

Can you share this files:

jerpelea commented 2 years ago

@vlad-isayko

sure! here are the parquet files

files.zip

vlad-isayko commented 2 years ago

@jerpelea

Ok, there is a bug in saving pandas dataframe in parquet format. A column where all None values are converted to Int32 when stored.

This case is quite rare, apparently because of this we did not catch this bug earlier.

We plan to fix this bug.

At the moment, you can resave these files in the correct conversion.

jerpelea commented 2 years ago

@vlad-isayko how do I resave them ?

vlad-isayko commented 2 years ago

@jerpelea

You can run this simple script. Or can share files from /data/staging/github/events/push/, so I can do it for you

import pandas as pd
from pathlib import Path

for path in Path('/data/staging/github/events/push/').rglob('*.parquet'):
    pd.read_parquet(path).astype({'language': str, 'org_name': str}).to_parquet(path, index=False)
jerpelea commented 2 years ago

@vlad-isayko thanks for the fix

It fixed the issue and step 6 completed