X-lab2017 / open-digger

Open source analysis tools
https://open-digger.cn
Apache License 2.0
291 stars 86 forks source link

[Data Export] shorten the oss uploading time in the beginning of every month #1306

Open tyn1998 opened 1 year ago

tyn1998 commented 1 year ago

Description

Hi community,

Is it possible to shorten the oss uploading time in the beggining of every month? Or could you choose a fixed day from a month and anounce it as a due date before which all data exporting tasks are completed?

This is so important for downstream apps who consume OpenDigger's valuable data.

image image
frank-zsy commented 1 year ago

@tyn1998 I think we had the discussion before and the solution is put the update time into meta data of each repo, like in the file https://oss.x-lab.info/open_digger/github/X-lab2017/open-digger/meta.json , there will be a field called updatedAt which is timestamp to indicate when the data is updated, you can use the field to find out if the data has been updated or not for current month.

tyn1998 commented 1 year ago

Hi @frank-zsy, thanks for your reply.

I know the existence of meta.json files. Actually in this issue, I mean if some methods like parallel computing and uploading can be adopted to speed up the data exporting and uploading processes. So hopefully all export tasks can be completed in 24 hours or even in several hours.

I noticed that writeFileSync (the syncronous version of writeFile) is used in cron tasks to write json files into the file system of your machine. Would it be faster if fs.writeFile is used instead so following computing tasks don't need to wait for file writing?

I also assume that after the cron tasks are executed another set of scripts(not included in this repository) are run to upload the exported files to the aliyun oss. Could those scripts for uploading files be improved to shorten the uploading time?

What is the bottleneck now? Computing or uploading?

frank-zsy commented 1 year ago

Understood, so I will elaborate the tasks here, there are several steps needed for the data update process.

So if we start all the task in the 11am the first day of a month, OpenRank data import, calculation and export may take about 2 hours, then metrics computation and network export may take 5 hours, and the data upload may also take 5-6 hours to complete.

So if we can make all the process parallel and automated, the whole process may take about 12-13 hours to complete which is the midnight of the first day of the month.

But right now, the process is not fully automated so the data may be updated about the 2nd day of a month like for 2023.5, the data is updated on today's morning.

tyn1998 commented 1 year ago

@frank-zsy Thanks for your detailed elaboration! This is the first time I have known the complete steps for exporting monthly data and I am convinced that the tasks are indeed time consuming.

I recommend to write the steps mentioned above into src/cron/README.md so more interested people can share the knowledge of how datas are exported by OpenDigger in every month :D

frank-zsy commented 1 year ago

Agreed, I will add the information into README file, and to improve the performance, I think several things can be done.

ossutilmac64 sync ~/github_data/open_digger/github oss://xlab-open-source/open_digger/github --force --job=1000 --meta "Expires:2023-07-01T22:00:00+08:00" --config-file=~/.ossutilconfig-xlab

The script upload files in 1000 parallel thread and set meta making the process a little bit longer than just upload the files to OSS. So maybe bigger parallel job parameter or deploy the task in the same VPC with OSS may reduce the time but not very much I think because right now the network payload is not very high, maybe because the files iteration process are time consuming.