Open Stockard opened 4 years ago
数据源有好多个,包括第三方API。以及组员贡献的,散乱的数据表格式。需要把这些数据汇总,存储起来。并且可以直接用于模型。 举几个例子: 第三方API:丁香园疫情(API暂时挂了),输出的Json格式,模型不太好用,需要先理成数据框。比如这里的和这里数据。这个数据需要及时存储,因为第三方API不太稳定。 散乱的数据表:组员贡献了从百度迁徙上爬下来的数据,这个数据很多人需要用,建迁徙模型也需要用到。这些数据大家还在持续贡献,有好的就需要及时录入。 经济数据:市级的数据,如GDP,医院数;同样还有省级的数据。未来可能还有国家级的数据。
Given the amount of data we have and we will have, I would go for the following solution.
1.让消费端用graphql来取数据,要什么取什么,后端不需要定制化
2.后端整理数据统一入库,不知道我们用的是什么存数据,存Hbase,还是直接存postgresql?
如果弄数据模型比较麻烦,直接用postgresql弄个v8能写js,通过写js脚本获取个性化的数据
@emptymalei 's proposal sounds feasible. In fact, based on my observations on the evolution of other sub-projects of this org, it seems the frontend and data-sync has gone down the path that they self serve their data without talking to an actual "backend API", that said, I think the api-server
is more suitable for being consumed by the data science needs.
@Gamehu 's comments also sound good, Graphql might be more suitable for this use case than RESTful APIs, especially given the data model has not been standardized yet which subjects to changes every day. Since I'm not that familiar with GraphQL, I cannot comment on the estimation of the work I have to do with it.
When it comes to DBs, no matter what we decide to use, NoSQL/SQL... it's worth mentioning we should prefer any cloud-managed instances (such as GCP CloudSQL, Firebase, AWS RDS, dynamo, etc...) than maintaining our own.
@Stockard Reading through your top priority, could you clarify what does "疫情数据可视化" mean? Especially what kind of info about epidemic do we want to visualize here? I could try to prioritize making the specific API endpoints if it is more clear.
对做hadoop之类的同学来说现在的数据就是毛毛雨了,不过还是需要有人帮忙梳理一下计算架构。 建立这个结构的目的是可以方便地把现有的数据接入到线上的模型或者用于计算测试模型中,最后输出一些计算结果可以直接对接到可视化,或者持续输出到其他项目中。 希望能够快点部署,所以也希望能够利用现有的工具什么的。 我在这方面基本是零经验,请大家随便提意见。特别可以说说自己用过的解决方案,集思广益。
目前的情况是,
目前我知道要做的事情,但是对怎么做没什么头绪,想听听大家的想法。