ZuInnoTe / hadoopcryptoledger

Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Apache License 2.0
141 stars 51 forks source link

Create Web scraper to fetch currency exchange rates #28

Open jornfranke opened 7 years ago

jornfranke commented 7 years ago

Create web scraper to fetch currency exchange rates for currencies based on cryptoledgers. This should support multiple sources in a generic way and the data should be made available together with cryptoledger data provided already by the HadoopCryptoLedger library.

First we need to work on a design, implementation, unit & integration tests (the latter probably on an embedded jetty and/or Tomcat)

liorregev commented 7 years ago

A couple of questions regarding this issue, as I thinking on implementing a similar mechanism:

  1. Should this really be in the library? It seems that as of now the library is a data parser, and not a data provider. Adding this would make the library fetch actual data and not just parse existing one. Making it dependent on an external service and it's API
  2. Would you implement this as a different InputFormat, or rather as an option for the current formats (like enrich)? It seems like the datasets for rates on a daily average are very small (~400K). So it is possible to fetch them and broadcast them (available at least in spark and I imagine in Hadoop as well, but Spark is my specialty) to the entire cluster and enrich every transaction with it's USD value.
  3. Did you have a specific service in mind? @omervk and myself found this and it seems very comfortable
jornfranke commented 7 years ago

Honestly I have not thought about it that much except that it would NOT be in the inputformat or the Spark data source, but a subproject that can be used independently (cf flink and Hive support which are also independent).

Maybe it is just a normal application, such as Squoop for jdbc. It could be also something which is fetched live but since they are based on Rest APIs this might be not such a good choice, so daily import with a dedicated spark job and cache then locally could make sense. This would also allow easier handling of errors and a reduction of them, because if you fetch them live the Rest service may fail and then the whole job fails, so it is better to store them locally in a controlled fashion. You might also want to do some cleaning of the data and cross-validate with other data (eg exchange rate from different exchanges, comparison of futures and exchange rate etc might be even be already interesting without using blockchain data itself).

There are also open legal questions. Some APIs have certain license restriction and the data itself can be subject to license restriction.

Then , one can also not support all possible data providers. Here one may have a more generic interface that can be reconfigured without programming effort.

On the other hand it is also something which organizations can implement easy on their own so probably this has not so high priority.

On 3. Dec 2017, at 11:34, Lior Regev notifications@github.com wrote:

A couple of questions about this, as I thinking on implementing a similar mechanism:

Should this really be in the library? It seems that as of now the library is a data parser, and not a data provider. Adding this would make the library fetch actual data and not just parse existing one. Making it dependent on an external service and it's API Would you implement this as a different InputFormat, or rather as an option for the current formats (like enrich)? It seems like the datasets for rates on a daily average are very small (~400K). So it is possible to fetch them and broadcast them (available at least in spark and I imagine in Hadoop as well, but Spark is my specialty) to the entire cluster and enrich every transaction with it's USD value. Did you have a specific service in mind? I found this and it seems very comfortable — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

jornfranke commented 7 years ago

that being said, I do not want to limit you with my thoughts. You can create such an application in your repositories and I am happy to link it from here. E.g. if you think the data should be fetched live from the web service and use the proposed service then i will put a link to it in the wiki, just notify me.

Maybe i will also change my mind what should be in there or not, as you see I have not given it that much thought yet, but my gut feeling is that it should be separated from the library to give the user the maximum flexibility.

liorregev commented 7 years ago

I would think that it should be a separate repository altogether, since like you said, it is completely independent of actual blockchain data. Also, as for local caching, while it is probably the best solution for high resolution data. It would make a pluggable library dependent on the user having a storage backend to link to it for caching. I do agree that it is easily implemented on the user's side and as such, what might be open sourced (imo) at best is various connectors for a variety of services and conversion of these to a common interface.

On Sun, Dec 3, 2017, 12:51 Jörn Franke notifications@github.com wrote:

that being said, I do not want to limit you with my thoughts. You can create such an application in your repositories and I am happy to link it from here. E.g. if you think the data should be fetched live from the web service and use the proposed service then i will put a link to it in the wiki, just notify me.

Maybe i will also change my mind what should be in there or not, as you see I have not given it that much thought yet, but my gut feeling is that it should be separated from the library to give the user the maximum flexibility.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ZuInnoTe/hadoopcryptoledger/issues/28#issuecomment-348755706, or mute the thread https://github.com/notifications/unsubscribe-auth/AC0YGfokZoZD_H2WOJ6QrtQYK7YpXinLks5s8n1KgaJpZM4QJflM .

jornfranke commented 7 years ago

I am also happy to link and promote such an endeavor. Of course also support, if i can.

I fear this dedicated storage will be needed in many cases. Especially once you have to pay for the data or you want to clean it or cross-validate it with other data sources or you have multiple users using it (every user would have to wait for the http requests instead of doing more instant analysis). Or if you want to do a time series analysis. Here you might need to match timestamps, different time zones etc.

These Rest APIs are not so reliable sometimes you have to try several times to get some data and they might be also down (DOS attacks etc.).

On 3. Dec 2017, at 12:07, Lior Regev notifications@github.com wrote:

I would think that it should be a separate repository altogether, since like you said, it is completely independent of actual blockchain data. Also, as for local caching, while it is probably the best solution for high resolution data. It would make a pluggable library dependent on the user having a storage backend to link to it for caching. I do agree that it is easily implemented on the user's side and as such, what might be open sourced (imo) at best is various connectors for a variety of services and conversion of these to a common interface.

On Sun, Dec 3, 2017, 12:51 Jörn Franke notifications@github.com wrote:

that being said, I do not want to limit you with my thoughts. You can create such an application in your repositories and I am happy to link it from here. E.g. if you think the data should be fetched live from the web service and use the proposed service then i will put a link to it in the wiki, just notify me.

Maybe i will also change my mind what should be in there or not, as you see I have not given it that much thought yet, but my gut feeling is that it should be separated from the library to give the user the maximum flexibility.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ZuInnoTe/hadoopcryptoledger/issues/28#issuecomment-348755706, or mute the thread https://github.com/notifications/unsubscribe-auth/AC0YGfokZoZD_H2WOJ6QrtQYK7YpXinLks5s8n1KgaJpZM4QJflM .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.