eprbell / dali-rp2

DaLI (Data Loader Interface) is a data loader and input generator for RP2 (https://pypi.org/project/rp2), the privacy-focused, free, open-source cryptocurrency tax calculator: DaLI removes the need to manually prepare RP2 input files. Just like RP2, DaLI is also free, open-source and it prioritizes user privacy.
https://pypi.org/project/dali-rp2/
Apache License 2.0
65 stars 42 forks source link

Implement the CCXT converter plugin reading in CSV files for Historical Data #59

Open macanudo527 opened 2 years ago

macanudo527 commented 2 years ago

Calls to APIs can sometimes be slow due to throttling. For example, Kraken limits requests to around 12 / minute. This can make pulling pricing data by API extremely slow (5 seconds per call). Historical data is often available in CSV form for major exchanges. These CSV can be read in by the plugin and cached to greatly increase the speed of pricing trades.

Some of these CSV are available in repositories that could be pulled automatically, but the links to them might be brittle. I propose we set up default download links and if they fail ask the user to report it or update the link themselves, possibly in a .cfg file?

eprbell commented 2 years ago

I'm a bit concerned about introducing a new ad hoc execution path and about brittle links: it may become a maintenance headache. How about we just rely on REST APIs but store the result in a pair-conversion-specific DaLI cache? This gives us the best of both worlds: it's fast (except the first time) and it uses the normal REST-based logic, so it's robust too.

eprbell commented 2 years ago

Also can you share pointers to the CSV repositories? I'd like to take a look at them.

macanudo527 commented 2 years ago

Well, my initial pull is 9 hours. Sure, after that, the cache can cover everything but the newest data, but it would be much smoother to d/l a CSV and import all that data for the first run. That Kraken bottleneck is a killer.

As for being brittle, it would just attempt to pull from the repositories if it isn't there, go to the API. If the user found the updated link they could supply that in a .cfg file.

Periodically saving to cache would also help this. Currently, the transaction_resolver only saves at the end correct?

Also, a progress bar would be very handy.

I'm not saying this is a top priority but would be handy. FX data in minute candles is also available this way. This FX data is much more accurate than HistoricalRates.host, which is only daily.

Here is Kraken's Historical. And here is Binance.

eprbell commented 2 years ago

Thanks for the pointers. Ok, let's try the CSV approach, especially considering that a run is taking 9 hours... The transaction resolver does save pair converter caches at the end, however if you want to force a save elsewhere you can call save_historical_price_cache() inside the plugin whenever makes sense (e.g. every 100 new reads or such).

eprbell commented 2 years ago

Thanks for the pointers. Ok, let's try the CSV approach, especially considering that a run is taking 9 hours... The transaction resolver does save pair converter caches at the end, however if you want to force a save elsewhere you can call save_historical_price_cache() inside the plugin whenever makes sense (e.g. every 100 new reads or such).

I guess we would have to download the CSV files we need on the fly (and we can perhaps store them in the .dali_cache directory, so that we don't have to download them again next time).

macanudo527 commented 1 year ago

I'm going to start working on this first to speed up my run times a little. I'm wondering about the architecture. Obviously, there should be some abstract class that the csv readers inherit from and then each exchange has its own class, but how should I pass on the csv of the historic data?

If I pass a file for each exchange into the CCXT pricing plugin, we are going to have a large __init__(). Should I use the dictionary format I used for the fiat priority? Implemented here, and described here.

eprbell commented 1 year ago

My initial thoughts about the CSV converter:

Does this hold water?

macanudo527 commented 1 year ago

The CSV might have to be local. In the case of Kraken, it is a 2.6gb zip file on a Google drive. But, it could lazy load the local file.

Or, I could prompt the user if they want to sit through a massive download for faster price resolutions.

I can design it so that it will just pull out and read all the historical data for one pair from the zip file instead of all pairs. That would save us some time.

Binance will have to be a hard-coded url, each market is a separate file.

eprbell commented 1 year ago

Can you clarify what you mean by "the CSV might have to be local"? After loading the file and storing it in cache it becomes local. Also I'm not sure I understand the difference in approach between Kraken and Binance: wouldn't we lazy load files in both cases and store them in cache?

My main concern is that I want to avoid adding external files to our tree, for two reasons: it creates potential licensing issues and it increases the size of our distro. So lazy loading seems a reasonable option.

macanudo527 commented 1 year ago

Kraken is one big zip file of CSVs, one CSV per market per time interval (eg. USDTUSD_1m, USDTUSD_5m, etc..) that is shared on a google drive link, which will probably break every quarter when they update it.

Binance.com has separate CSVs that must be downloaded one by one.

I wasn't thinking about adding the files to the distro. I was thinking we would ask users to download the files themselves (ie. make the data file local) and point the class to the file. However, we could download the file for them when needed. The problem with that though is the Kraken file is huge, and the link will break regularly, so it's probably better for the user to download it in the case of Kraken.

For Binance, we should be okay to simply download from their archive because it looks like the links are stable and the files are relatively small. Furthermore, asking the user to download every little CSV file they might need is obviously too cumbersome.

macanudo527 commented 1 year ago

Another option for Kraken is to download the individual zip file from the Google Drive folder by searching the shared folder, but that would require adding the googleapiclient package to the project requirements.

EDIT: We don't need googleapiclient you can access a lot of basic features with URLs. User will still need an API key though.

So, to summarize: We can do lazy net call for Binance and download the necessary CSV if needed. There are no issues there. Binance isn't that much of a priority anyway, since it is pretty fast.

For Kraken, we have 3 options:

eprbell commented 1 year ago

Thanks, that clarifies the issue. For Kraken, I would rule out option 1. How would the Google API key solution work: in other words if the path to the file is brittle, how would having Google API key access help and not break periodically?

macanudo527 commented 1 year ago

I'm assuming the shared directory will remain consistent.

The large bulk file will most likely change and that would give it a new fileId which is how Google drive keeps track of files, and what we need to download a file. I'm going to assume the directory that contains the individual files for each base asset will stay the same though and when they publish new historical data they will just copy over the old files. I can use the fileId for the directory and get a list of the files contained in it (by making a request to a URL that returns a JSON containing that information). Then, I can search through the list of files to find the file named after the base asset we are looking for, and get the fileId from the list.

Of course, if for some reason they create a whole new folder/directory and share that, it will break the link.

This is what the JSON response looks like:

{
    "kind": "drive#fileList",
    "etag": "\"2p0hfXggwf1I-PiKATAmZQf-u4U\"",
    "selfLink": "https://www.googleapis.com/drive/v2/files?q='1aoA6SKgPbS_p3pYStXUXFvmjqShJ2jv9'+in+parents",
    "nextPageToken": "~!!~AI9FV7SZrADC7bXlXGEJ1x73SoZHQTIHjM4OAlVhl0eTOV2EEvmW6P3vfqzNnRcdRRn4JdJQWSdutNjSXTcpGLvlHZNx5xiO8uRasb3T0lkrLnndxDTlMAA7zNBPj_-PU6p9Irfj-mM-xBDEptDyWV5RbHQF4Dx0aeBCnZBa_PNGgEuTPiI6IIrEtxKb0FrccVcmm52Pc0fUiw8eVuq7U80_upyEQDvVt25TRnmb5e5FJTFfiNo6bdZLCmyc7hoCynhEgFUX_O2KgBNqQhqQ-S9pmNcPgG07b-WhjzOKmXB4KW9p_rfRJZxf5u_tRklFw5_-Wyr96BjH",
    "nextLink": "https://www.googleapis.com/drive/v2/files?pageToken=~!!~AI9FV7SZrADC7bXlXGEJ1x73SoZHQTIHjM4OAlVhl0eTOV2EEvmW6P3vfqzNnRcdRRn4JdJQWSdutNjSXTcpGLvlHZNx5xiO8uRasb3T0lkrLnndxDTlMAA7zNBPj_-PU6p9Irfj-mM-xBDEptDyWV5RbHQF4Dx0aeBCnZBa_PNGgEuTPiI6IIrEtxKb0FrccVcmm52Pc0fUiw8eVuq7U80_upyEQDvVt25TRnmb5e5FJTFfiNo6bdZLCmyc7hoCynhEgFUX_O2KgBNqQhqQ-S9pmNcPgG07b-WhjzOKmXB4KW9p_rfRJZxf5u_tRklFw5_-Wyr96BjH&q='1aoA6SKgPbS_p3pYStXUXFvmjqShJ2jv9'+in+parents",
    "incompleteSearch": false,
    "items": [
        {
            "kind": "drive#file",
            "id": "1Cd8aUdWnZIxTUVMETwuYbLiTRLFi_qjl", - id needed to download the file
            "etag": "\"MTY1NzcxMTk3NzAwMA\"",
            "selfLink": "https://www.googleapis.com/drive/v2/files/1Cd8aUdWnZIxTUVMETwuYbLiTRLFi_qjl",
            "webContentLink": "https://drive.google.com/uc?id=1Cd8aUdWnZIxTUVMETwuYbLiTRLFi_qjl&export=download",
            "alternateLink": "https://drive.google.com/file/d/1Cd8aUdWnZIxTUVMETwuYbLiTRLFi_qjl/view?usp=drivesdk",
            "embedLink": "https://drive.google.com/file/d/1Cd8aUdWnZIxTUVMETwuYbLiTRLFi_qjl/preview?usp=drivesdk",
            "openWithLinks": {
                "338347331578": "https://drive.google.com/file/d/1Cd8aUdWnZIxTUVMETwuYbLiTRLFi_qjl/view?usp=drivesdk"
            },
            "defaultOpenWithLink": "https://drive.google.com/file/d/1Cd8aUdWnZIxTUVMETwuYbLiTRLFi_qjl/view?usp=drivesdk",
            "iconLink": "https://drive-thirdparty.googleusercontent.com/16/type/application/zip",
            "title": "USD_OHLCVT.zip", - name of the file
            "mimeType": "application/zip",
            "labels": {
                "starred": false,
                "hidden": false,
                "trashed": false,
                "restricted": false,
                "viewed": false
            },
            "copyRequiresWriterPermission": false,
            "createdDate": "2022-07-13T14:06:03.367Z",
            "modifiedDate": "2022-07-13T11:32:57.000Z",
            "markedViewedByMeDate": "1970-01-01T00:00:00.000Z",
            "version": "31",
            "parents": [
                {
                    "kind": "drive#parentReference",
                    "id": "1aoA6SKgPbS_p3pYStXUXFvmjqShJ2jv9",
                    "selfLink": "https://www.googleapis.com/drive/v2/files/1Cd8aUdWnZIxTUVMETwuYbLiTRLFi_qjl/parents/1aoA6SKgPbS_p3pYStXUXFvmjqShJ2jv9",
                    "parentLink": "https://www.googleapis.com/drive/v2/files/1aoA6SKgPbS_p3pYStXUXFvmjqShJ2jv9",
                    "isRoot": false
                }
            ],
            "downloadUrl": "https://content.googleapis.com/drive/v2/files/1Cd8aUdWnZIxTUVMETwuYbLiTRLFi_qjl?key=AIzaSyAa8yy0GdcGPHdtD083HiGGx_S0vMPScDM&alt=media&source=downloadUrl",
            "userPermission": {
                "kind": "drive#permission",
                "etag": "\"Gyyt2pIk97_SnBBIKTPqA8cfCag\"",
                "id": "me",
                "selfLink": "https://www.googleapis.com/drive/v2/files/1Cd8aUdWnZIxTUVMETwuYbLiTRLFi_qjl/permissions/me",
                "role": "reader",
                "type": "user",
                "pendingOwner": false
            },
            "originalFilename": "USD_OHLCVT.zip",
            "fileExtension": "zip",
            "md5Checksum": "cb8d8ce2f596c2250eb42ccb0aedad7f",
            "fileSize": "162490006",
            "quotaBytesUsed": "162490006",
            "ownerNames": [
                "Trading History"
            ],
            "owners": [
                {
                    "kind": "drive#user",
                    "displayName": "Trading History",
                    "picture": {
                        "url": "https://lh3.googleusercontent.com/a/default-user=s64"
                    },
                    "isAuthenticatedUser": false,
                    "permissionId": "14876986118543214761",
                    "emailAddress": "k3magh6mjzgw24ck2@gmail.com"
                }
            ],
            "lastModifyingUserName": "Trading History",
            "lastModifyingUser": {
                "kind": "drive#user",
                "displayName": "Trading History",
                "picture": {
                    "url": "https://lh3.googleusercontent.com/a/default-user=s64"
                },
            "isAuthenticatedUser": false,
            "permissionId": "14876986118543214761",
            "emailAddress": "k3magh6mjzgw24ck2@gmail.com"
            },
            "capabilities": {
                "canCopy": true,
                "canEdit": false
            },
            "editable": false,
            "copyable": true,
            "writersCanShare": true,
            "shared": true,
            "explicitlyTrashed": false,
            "appDataContents": false,
            "headRevisionId": "0B_fGz8zz5iToVmU0WVJXSldKSjJqNXVnVDRjNmprQXBybldvPQ",
            "spaces": [
                "drive"
            ]
    },
eprbell commented 1 year ago

Ok, sounds good: if it breaks (hopefully won't happen frequently or at all) we'll fix the plugin. My vote is for 3, but 2 is also a reasonable solution. So you get the last word on this!

macanudo527 commented 1 year ago

Actually, looking over the docs again, I can just send one query string with the parent fileId (the fileId of the directory) and the name of the file, which is easy to form (base_asset + "_OHLCVT.zip").

That returns a much cleaner JSON:

{
 "kind": "drive#fileList",
 "incompleteSearch": false,
 "files": [
  {
   "kind": "drive#file",
   "id": "1AAWkwfxJjOvZQKv3c5XOH1ZjoIQMblQt",
   "name": "USDT_OHLCVT.zip",
   "mimeType": "application/zip"
  }
 ]
}

I'll roll with 3, and if it breaks, 2 should be easy enough to fall back on.

macanudo527 commented 1 year ago

A Kraken CSV reader has been created with #105. If I have time I will try to implement FX and Binance, but I'll be addressing other issues for now.