X-lab2017 / open-digger

Open source analysis tools
https://open-digger.cn
Apache License 2.0
299 stars 87 forks source link

[Feature] Handle data lose #1040

Closed frank-zsy closed 1 year ago

frank-zsy commented 2 years ago

As GHArchive was offline in 2021.10 for about half a month, the statistical data in 2021.10 are all about 50% of 2021.9 and 2021.11.

So when we upload data to OSS, we can do a little trick to fix the problem. I think we can use like 0.15 * Value_20218 + 0.35 * Value_20219 + 0.35 * Value_202211 + 0.15 * Value_202212 to estimate the actually value of 2021.10, and still we should keep a field 202110_original to store the real value.

open-digger-bot[bot] commented 2 years ago

This issue has not been replied for 24 hours, please pay attention to this issue: @gymgym1212 @xiaoya-yaya @xgdyp

frank-zsy commented 1 year ago

I will implement this recently, the new data will contain 2021-10-raw field to store the original data, and new 2021-10 will be generated by 2021-08 to 2021-12 as above.

I think this will effect Hypercrx too. @tyn1998

frank-zsy commented 1 year ago

The data is ready now, I can upload data after Hypercrx fit the format, may also support other fields like 2022, 2022-Q2 and all too.

tyn1998 commented 1 year ago

"recently" = 20 mins 🤣

I think this will effect Hypercrx too. @tyn1998

Yes.

The data is ready now, I can upload data after Hypercrx fit the format

The procedure is:

  1. code for data parsing will be adapted to process both old and new formats
  2. release a version to the two stores and wait until both approve the new version
  3. upload new data to oss
  4. remove code that specific to the old format(not a must)

, may also support other fields like 2022, 2022-Q2 and all too.

Will these new data fields occur in the existing metrics? Or they will just be put into following new metrics? For the second case, I think we can do this when we actually present them.

tyn1998 commented 1 year ago

I think @zhicheng-ning may also be informed since data service that supports serveral DataEase screens is also a consumer of OpenDigger data.

frank-zsy commented 1 year ago

Will these new data fields occur in https://github.com/hypertrons/hypertrons-crx/issues/515#issue-1444862182? Or they will just be put into following new metrics? For the second case, I think we can do this when we actually present them.

I am not quite sure about this one, for statistical metrics, quarterly and yearly data can be calculated by monthly data, so actually we only need to add new data fields to the metrics that can not be simply added. Actually there is a metric that fit the rule which is participants.

I think @zhicheng-ning may also be informed since data service that supports several DataEase screens is also a consumer of OpenDigger data.

Yes, @zhicheng-ning do you think this may effect DataEase dashboards?

zhicheng-ning commented 1 year ago

0.15 Value_20218 + 0.35 Value_20219 + 0.35 Value_202211 + 0.15 Value_202212

Hi, I want to know why 0.35 * Value_202211 + 0.15 * Value_202212 is here, not 0.35 * Value_202111 + 0.15 * Value_202112

frank-zsy commented 1 year ago

Hi, I want to know why 0.35 * Value_202211 + 0.15 * Value_202212 is here, not 0.35 * Value_202111 + 0.15 * Value_202112

I can not find out what is the difference here.

zhicheng-ning commented 1 year ago

do you think this may effect DataEase dashboards?

Actually I'm working on od-api which is a data transform service. I think the change in the upstream data format has little impact on me, but as there are more and more downstream projects in the future, I suggest that the upstream data format is as stable as possible.

zhicheng-ning commented 1 year ago

I can not find out what is the difference here.

Value_202211 -> Value_202111

frank-zsy commented 1 year ago

@zhicheng-ning That's is my mistake and a typo, I mean 202108 - 202112.

frank-zsy commented 1 year ago

Actually I'm working on od-api which is a data transform service. I think the change in the upstream data format has little impact on me, but as there are more and more downstream projects in the future, I suggest that the upstream data format is as stable as possible.

Agreed, I think we can make the APIs format contains as much data as we have and then we will not change them later, as this is still the early age of OpenDigger data export process, format changes maybe inevitable.

tyn1998 commented 1 year ago

Actually there is a metric that fit the rule which is participants.

I got it. So the new data fields will occur in some of the existing metrics and changes in Hypercrx are required.

tyn1998 commented 1 year ago

Hi, @frank-zsy, when was data with raw exported and uploaded to OSS?

Hypercrx has not been ready for the new yyyy-mm-raw field so charts with an extra yyyy-mm-raw are broken now:

image

I will fix it right now, with https://github.com/hypertrons/hypertrons-crx/issues/577 handled as well.