JesseVent / crypto

Cryptocurrency Historical Market Data R Package
https://CRAN.R-project.org/package=crypto
Other
143 stars 34 forks source link

Coin name is mislabeled in dataset for duplicate-symbols #2

Closed mikelambert closed 6 years ago

mikelambert commented 6 years ago

Thanks for doing this work, greatly appreciated.

Was going to do some analysis of it on my own, and got confused by the presence of duplicates on some coins.

For example, up until 12-12, we have one datapoint per date, whereas after that shows two datapoints per date:

...
"PRO",2017-12-08,0.350132,0.35717,0.323824,0.353445,70668,4921880,"Propy",385
"PRO",2017-12-09,0.357165,0.391303,0.341962,0.368005,129489,5020740,"Propy",385
"PRO",2017-12-10,0.368921,0.368921,0.336974,0.34417,61321,5186000,"Propy",385
"PRO",2017-12-11,0.343705,0.398727,0.337944,0.398727,65037,4831540,"Propy",385
"PRO",2017-12-12,0.397612,0.529153,0.385998,0.488748,206352,5589310,"Propy",385
"PRO",2017-12-13,0.386384,0.483263,0.382225,0.428319,2150570,0,"Propy",385
"PRO",2017-12-13,0.490256,0.606039,0.489343,0.569292,190018,6891630,"Propy",385
"PRO",2017-12-14,0.428882,0.428882,0.363489,0.411705,2015520,0,"Propy",385
"PRO",2017-12-14,0.568401,0.60429,0.533319,0.575479,110802,7990140,"Propy",385
"PRO",2017-12-15,0.41248,0.427116,0.375678,0.413479,1403610,0,"Propy",385
"PRO",2017-12-15,0.576018,0.582851,0.554633,0.572736,108230,8097210,"Propy",385
"PRO",2017-12-16,0.414505,0.489333,0.401455,0.456695,3102170,0,"Propy",385
"PRO",2017-12-16,0.573556,0.805846,0.573556,0.671252,184437,8062590,"Propy",385
"PRO",2017-12-17,0.462755,0.520551,0.440206,0.454255,1461940,0,"Propy",385
"PRO",2017-12-17,0.672319,0.697868,0.607462,0.689788,145157,9450930,"Propy",385
"PRO",2017-12-18,0.453932,0.477431,0.423593,0.473935,1583290,0,"Propy",385
"PRO",2017-12-18,0.701731,0.701731,0.575447,0.644326,177428,9864380,"Propy",385

Only one has a non-empty market value...so I'm going to go with that. (I assume market refers to market-cap? I thought at first it might be showing data from two different market exchanges or something.)

mikelambert commented 6 years ago

Actually, my prioritization logic appears flawed:

...
"BTG",2017-10-20,0.819804,1.2,0.80772,1.19,80,0,"Bitcoin Gold",11
"BTG",2017-10-21,0.873455,1.25,0.862919,0.991196,59,0,"Bitcoin Gold",11
"BTG",2017-10-22,1.01,2.09,0.844422,1.7,1756,0,"Bitcoin Gold",11
"BTG",2017-10-23,479.82,539.72,479.82,500.13,7652060,0,"Bitcoin Gold",11
"BTG",2017-10-23,1.7,13.43,1.11,7.04,41557,0,"Bitcoin Gold",11
...
"BTG",2017-11-23,241.97,299.89,241.97,293.61,154038000,0,"Bitcoin Gold",11
"BTG",2017-11-23,5.84,6.45,5.72,5.9,7800,345650,"Bitcoin Gold",11
"BTG",2017-11-24,295.75,413.74,284.26,394.22,537472000,0,"Bitcoin Gold",11
"BTG",2017-11-24,5.89,7.96,4.1,5.42,10850,348941,"Bitcoin Gold",11
"BTG",2017-11-25,394.04,394.04,339.1,356.04,208662000,0,"Bitcoin Gold",11
"BTG",2017-11-25,5.4,6.64,4.68,5.68,7480,320152,"Bitcoin Gold",11
"BTG",2017-11-26,355.72,366.79,334.74,366.79,141228000,5930460000,"Bitcoin Gold",11
"BTG",2017-11-26,5.68,6.39,4.36,5.31,3204,336402,"Bitcoin Gold",11
"BTG",2017-11-27,370.18,387.88,353.67,359.25,129160000,6172140000,"Bitcoin Gold",11
"BTG",2017-11-27,5.31,5.5,4.39,5.43,8423,314816,"Bitcoin Gold",11
...

So:

Am I parsing this data wrong, and I should know a better way to deal with these duplicate timeseries, or is there some extraneous data creeping in here? Thanks!

mikelambert commented 6 years ago

Ooooh, sorry, I figured out that this is due to coins on coinmarketcap that share a ticker. PRO, BTG, ACC, etc.

Not sure of a correct way to distinguish them in the dataset...especially since name column appears to choose an arbitrary coin instead of naming both coins. For example, there are only datapoints for Bitcoin Gold (instead of Bitgem), Propy (instead of ProChain), etc.

JesseVent commented 6 years ago

Hi Mike, you're spot on the issue is due to several tokens sharing the same symbol. I didn't know how to go about resolving it, but then figured i'd use the slug i'm using to generate the urls for scraping, and then use that as a unique identifier instead.

The change I just committed should resolve the duplication issues and also theres a couple extra features included.

Let me know how you go, thanks

> head(pro)
   slug symbol  name       date ranknow     open     high      low    close volume   market close_ratio spread
1 propy    PRO Propy 2017-09-19     295 0.823919 0.858425 0.628423 0.745318  26854 11582000      0.5082   0.23
2 propy    PRO Propy 2017-09-20     295 0.744813 0.933790 0.644857 0.862584 102433 10470000      0.7536   0.29
3 propy    PRO Propy 2017-09-21     295 0.859565 0.982731 0.743939 0.809898  74579 12083100      0.2762   0.24
4 propy    PRO Propy 2017-09-22     295 0.773040 0.792471 0.588509 0.658002 136747 10866800      0.3407   0.20
5 propy    PRO Propy 2017-09-23     295 0.657034 1.470000 0.559158 0.724104 298708  9236070      0.1811   0.91
6 propy    PRO Propy 2017-09-24     295 0.731472 0.734890 0.571775 0.615710 204870 10282500      0.2693   0.16
> tail(pro)
        slug symbol     name       date ranknow     open     high      low    close  volume market close_ratio spread
122 prochain    PRO ProChain 2017-12-28    1096 0.360183 0.365566 0.331167 0.354053  626739      0      0.6653   0.03
123 prochain    PRO ProChain 2017-12-29    1096 0.352030 0.401564 0.345843 0.357103  523329      0      0.2021   0.06
124 prochain    PRO ProChain 2017-12-30    1096 0.355327 0.358116 0.307769 0.326998  661599      0      0.3819   0.05
125 prochain    PRO ProChain 2017-12-31    1096 0.328685 0.378672 0.324813 0.358117  888244      0      0.6184   0.05
126 prochain    PRO ProChain 2018-01-01    1096 0.358136 0.358136 0.331516 0.345254 1424280      0      0.5161   0.03
127 prochain    PRO ProChain 2018-01-02    1096 0.345629 0.446480 0.345629 0.417606 4645990      0      0.7137   0.10
mikelambert commented 6 years ago

Awesome, the slug works great, thank you!