Open theycallmeswift opened 2 years ago
@cm-howard any thoughts on this? Alternatively, would appreciate anything you could do to point me in the right direction
@theycallmeswift are there any files in the '/data' dir?
@vlad-isayko yep!
python3 osci-cli.py get-github-daily-push-events -d YYYY-MM-DD
produces YYYY-MM-DD-[0-23].parquet
files in /data/landing/github/events/push/YYYY/MM/DD/
and
python3 osci-cli.py process-github-daily-push-events -d YYYY-MM-DD
produces COMPANY-YYYY-MM-DD.parquet
files in /data/staging/github/raw-events/push/YYYY/MM/DD
@theycallmeswift I have a similar error on Ubuntu 20.04 Did you manage to fix the error locally?
@jerpelea I did not unfortunately. The docs need a serious overhaul from someone who knows the system better than me!
@theycallmeswift @jerpelea Hello, the problem is really outdated and incomplete documentation. We will fix this in the coming days. I'll keep you posted
@vlad-isayko can you share some quick update here before updating the documentation
At the moment, this is the current way to start
python3 osci-cli.py get-github-daily-push-events -d 2020-01-01
python3 osci-cli.py process-github-daily-push-events -d 2020-01-01
python3 osci-cli.py daily-active-repositories -d 2020-01-01
python3 osci-cli.py load-repositories -d 2020-01-01
python3 osci-cli.py filter-unlicensed -d 2020-01-01
python3 osci-cli.py daily-osci-rankings -td 2020-01-01
python3 osci-cli.py get-change-report -d 2020-01-01
You can write to me if you have any problems
@vlad-isayko
Thanks for your quick answer
Everything behaved normal until step 6 python3 osci-cli.py daily-osci-rankings -td 2020-01-01
attached is the log log.log
I am running Ubuntu 20.04 with python 3.8
@jerpelea can you also share what version of pyspark and spark do you have?
@vlad-isayko
packages from .local/lib/python3.8/site-packages installed by pip install -r requirements.txt
aiohttp-3.8.1.dist-info aiosignal-1.2.0.dist-info async_timeout-4.0.2.dist-info attrs-21.4.0.dist-info azure_common-1.1.25.dist-info azure_core-1.7.0.dist-info azure_functions-1.3.0.dist-info azure_functions_durable-1.1.3.dist-info azure_nspkg-3.0.2.dist-info azure_storage_blob-12.3.2.dist-info azure_storage_common-2.1.0.dist-info azure_storage_nspkg-3.1.0.dist-info cachetools-4.2.4.dist-info charset_normalizer-2.0.12.dist-info click-7.1.2.dist-info deepmerge-0.1.1.dist-info frozenlist-1.3.0.dist-info furl-2.1.3.dist-info google_api_core-1.31.5.dist-info googleapis_common_protos-1.56.1.dist-info google_auth-1.35.0.dist-info google_cloud_bigquery-1.25.0.dist-info google_cloud_core-1.7.2.dist-info google_resumable_media-0.5.1.dist-info iniconfig-1.1.1.dist-info isodate-0.6.1.dist-info Jinja2-2.11.3.dist-info MarkupSafe-2.0.1.dist-info more_itertools-8.13.0.dist-info msrest-0.6.21.dist-info multidict-6.0.2.dist-info numpy-1.19.5.dist-info orderedmultidict-1.0.1.dist-info packaging-21.3.dist-info pandas-1.0.3.dist-info pbr-5.9.0.dist-info pip-22.1.2.dist-info pluggy-0.13.1.dist-info protobuf-4.21.1.dist-info py-1.11.0.dist-info py4j-0.10.9.dist-info pyarrow-0.17.1.dist-info pyasn1-0.4.8.dist-info pyasn1_modules-0.2.8.dist-info pypandoc-1.5.dist-info pyparsing-3.0.9.dist-info pyspark-3.0.1.dist-info pytest-6.0.1.dist-info python_dateutil-2.8.1.dist-info PyYAML-5.4.dist-info requests_oauthlib-1.3.1.dist-info rsa-4.8.dist-info six-1.13.0.dist-info testresources-2.0.1.dist-info toml-0.10.2.dist-info XlsxWriter-1.2.3.dist-inf
@jerpelea may be there are some problems with parquet file. We need to check it
@vlad-isayko what version are you using? Do you have any suggestions how to check it?
@jerpelea we use the same libraries with the same versions. Can you share some files that generated in staging area?
@vlad-isayko thanks for your quick answer Here is the file repository-2021-01-01.zip
@jerpelea
Is there any files in /staging/github/events/push/2021/01/01/
?
Before step 6 there should be files in directories:
/staging/github/raw-events/push/2021/01/01/
/staging/github/repository/2021/01/
/staging/github/events/push/2021/01/01/
@vlad-isayko I have /landing/githug/events/push/2021/01/01/ /staging/github/raw-events/push/2021/01/01/ /staging/github/repository/2021/01/
there is no /staging/github/events/push/2021/01/01/
Thanks
@jerpelea
Can you rerun step 5 python3 osci-cli.py filter-unlicensed -d 2020-01-01
and share logs from this command?
I think that there some problem at this step.
@vlad-isayko attached are the log file and some result files
filter-unlicensed.zip github.zip
thanks
@jerpelea
Ok, it's strange that repository file in staging is empty...
Is there this file /landing/github/repository/2021/01/2021-01-01.csv
?
Can you share it?
2021-01-01.zip @vlad-isayko
@jerpelea
So the error occurred at step 4 when getting information about the repositories from the Github API.
I ran this step on my own with your source file and I will then check the output.
Could you check your config for a valid github api token?
github:
token: '394***************************************77'
@vlad-isayko thanks for pointing it out I think that token setup is a missing step on the README I added the token in local.yml and restarted step 4
this is how the logs look now [2022-06-13 09:42:38,265] [INFO] Get repository MinCiencia/Datos-COVID19 information [2022-06-13 09:42:38,265] [DEBUG] Make request to Github API method=GET, url=https://api.github.com/repos/MinCiencia/Datos-COVID19, kwargs={} [2022-06-13 09:42:38,485] [DEBUG] https://api.github.com:443 "GET /repos/MinCiencia/Datos-COVID19 HTTP/1.1" 200 None [2022-06-13 09:42:38,486] [DEBUG] Get response[200] from Github API method=GET, url=https://api.github.com/repos/MinCiencia/Datos-COVID19, kwargs={'headers': {'Authorization': 'token gxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxdo'}}
I will keep you updated on the progress Thanks for support
@vlad-isayko new errors at step6 daily-osci-rankings.zip
@jerpelea
Can you share this files:
/data/staging/github/events/push/2020/01/01/unity_technologies-2020-01-01.parquet
/data/staging/github/events/push/2020/01/01/secops_solutions-2020-01-01.parquet
/data/staging/github/events/push/2020/01/01/luxoft-2020-01-01.parquet
/data/staging/github/events/push/2020/01/01/lyft-2020-01-01.parquet
/data/staging/github/events/push/2020/01/01/cloudbees-2020-01-01.parquet
@jerpelea
Ok, there is a bug in saving pandas dataframe in parquet format. A column where all None values are converted to Int32 when stored.
This case is quite rare, apparently because of this we did not catch this bug earlier.
We plan to fix this bug.
At the moment, you can resave these files in the correct conversion.
@vlad-isayko how do I resave them ?
@jerpelea
You can run this simple script. Or can share files from /data/staging/github/events/push/
, so I can do it for you
import pandas as pd
from pathlib import Path
for path in Path('/data/staging/github/events/push/').rglob('*.parquet'):
pd.read_parquet(path).astype({'language': str, 'org_name': str}).to_parquet(path, index=False)
@vlad-isayko thanks for the fix
It fixed the issue and step 6 completed
Hey, folks --
I'm having trouble getting the basic example provided to run. Specifically the failure I'm encountering is at the
daily-osci-rankings
stage. I have confirmed that I have a functioning local version of Hadoop installed. Running on Ubuntu 20.04 LTS VPS with a fresh install.I pulled the two most visible errors from the log out below (full log expandable at bottom of issue). It's unclear to me if they are related though.
Any help pointing me in the right direction would be appreciated!
Full Error Log:
``` [2022-03-22 18:11:05,996] [INFO] ENV: None [2022-03-22 18:11:05,997] [DEBUG] Check config file for env local exists [2022-03-22 18:11:05,997] [DEBUG] Read config from /home/ubuntu/OSCI/osci/config/files/local.yml [2022-03-22 18:11:06,000] [DEBUG] Prod yml load: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}} [2022-03-22 18:11:06,000] [DEBUG] Prod yml res: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}} [2022-03-22 18:11:06,000] [INFO] Full config: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}} [2022-03-22 18:11:06,000] [INFO] Configuration loaded for env: local [2022-03-22 18:11:06,000] [DEBUG] Create new