eprbell / dali-rp2

DaLI (Data Loader Interface) is a data loader and input generator for RP2 (https://pypi.org/project/rp2), the privacy-focused, free, open-source cryptocurrency tax calculator: DaLI removes the need to manually prepare RP2 input files. Just like RP2, DaLI is also free, open-source and it prioritizes user privacy.
https://pypi.org/project/dali-rp2/
Apache License 2.0
65 stars 42 forks source link

Add Optimization to Djikstra #146

Closed macanudo527 closed 1 year ago

macanudo527 commented 1 year ago

This addresses #145 , #143 and helps clear the way to resolve #140

This will allow markets (or edges in the Djikstra search) to be added and deleted when cloning a graph. It also allows the optimization of a graph by altering the weights of edges during the cloning.

This is WIP. I submitted it to get feedback on the MappedGraph data class before implementing it in the CCXT pair converter.

To-do:

macanudo527 commented 1 year ago

@eprbell Here it is, my thesis project.

I tried my best to cleanly cut out each feature into its own commit to make this easier to review. If you find it to unwieldy, let me know I can try to separate it out more or make another PR, but a lot of this was very interrelated.

eprbell commented 1 year ago

@eprbell Here it is, my thesis project.

I tried my best to cleanly cut out each feature into its own commit to make this easier to review. If you find it to unwieldy, let me know I can try to separate it out more or make another PR, but a lot of this was very interrelated.

Awesome! I'm traveling internationally at the moment, so it may take a bit longer for me to review it. I'll get to it in the next few days.

eprbell commented 1 year ago

@eprbell Here it is, my thesis project. I tried my best to cleanly cut out each feature into its own commit to make this easier to review. If you find it to unwieldy, let me know I can try to separate it out more or make another PR, but a lot of this was very interrelated.

Awesome! I'm traveling internationally at the moment, so it may take a bit longer for me to review it. I'll get to it in the next few days.

Sorry for being late in reviewing this. I'm on vacation until the 19th, traveling in southern Europe. I was hoping to get some time to review PRs but it didn't happen. I'll get to it as soon as I can, but it may be slower than usual. Thanks for your patience, Neal!

macanudo527 commented 1 year ago

Don't worry about the delay in reviewing. I'm quite busy with another project and job-hunting. We will chip away at it.

eprbell commented 1 year ago

Don't worry about the delay in reviewing. I'm quite busy with another project and job-hunting. We will chip away at it.

Sounds good! Good luck with the job hunt, and if you need a reference just let me know on private channel.

macanudo527 commented 1 year ago

I just realized manifest_builder.py uses multithreading, but I don't think it will actually increase the speed of anything, will it? It needs to be multi-process, right?

macanudo527 commented 1 year ago

This takes an incredibly long time (20-30minutes) to optimize the Kraken graph mostly because of all the fiat pairs. For example, Kraken has markets for BTC/USD, BTC/EUR, BTC/JPY, BTC/GBP all of which need to be optimized. If we only looked at markets with the quote asset of USD we would cut the time down significantly. If users want, they can optionally add another quote asset.

Another option is to implement multiprocessing since my CPUs are hardly being used during the download and chunk process.

eprbell commented 1 year ago

I just realized manifest_builder.py uses multithreading, but I don't think it will actually increase the speed of anything, will it? It needs to be multi-process, right?

In Python CPU-bound algorithms can be sped up using multiprocessing, whereas I/O-bound ones can be sped up with multithreading. This is due to a design issue in the python interpreter (GIL lock). Is this algorithm mostly CPU or I/O-bound?

eprbell commented 1 year ago

A couple of ideas:

macanudo527 commented 1 year ago

In Python CPU-bound algorithms can be sped up using multiprocessing, whereas I/O-bound ones can be sped up with multithreading. This is due to a design issue in the python interpreter (GIL lock). Is this algorithm mostly CPU or I/O-bound?

I think it is CPU bound since it is just sorting arrays in memory, not actually read/write to disk or a REST API.

macanudo527 commented 1 year ago
  • could we do internal price lookup in USD only and then convert from USD to the native fiat of the country plugin at the end? Not sure what the performance impact would be... Just a thought.

This would be best, and, in the future when fiat-native exchanges are added we can have them resolve to whatever fiat they are 'native' too. For example, I use bitbank here and everything is priced in JPY, so it makes sense to price everything in JPY first.

macanudo527 commented 1 year ago

I'm running into another issue. It looks like Huobi changed something about their API and it no longer seems to work like it should. They do have CSV files available though, so we will eventually be able to price things using that exchange just not now.

Should I just monkey patch it for now with zero prices for the two assets we need Huobi for? (SOLO and NEXO) In other words, have dali-rp2 insert zero prices with a info to logger informing users that the prices aren't available? That way people can at least process some taxes.

Otherwise, there is no other exchange that can price those assets (Binance has the most volume, but might be off-limits).

eprbell commented 1 year ago

I'm running into another issue. It looks like Huobi changed something about their API and it no longer seems to work like it should. They do have CSV files available though, so we will eventually be able to price things using that exchange just not now.

Should I just monkey patch it for now with zero prices for the two assets we need Huobi for? (SOLO and NEXO) In other words, have dali-rp2 insert zero prices with a info to logger informing users that the prices aren't available? That way people can at least process some taxes.

Otherwise, there is no other exchange that can price those assets (Binance has the most volume, but might be off-limits).

I'd rather not process them than process them with incorrect (zero) price. People who need those two assets priced can still use the manual plug in, until we implement the Huobi CSV plugin. But maybe open an issue so we can track this. Does this make sense?

macanudo527 commented 1 year ago

Well, I can't test with real world data since those assets get pulled in when I do a pull from Binance.com.

I could sub in the Binance.com route for now maybe. And then try to put together a Huobi plugin soonish.

macanudo527 commented 1 year ago

Looking around, it looks like Gate.io will do in a pinch for both assets. The volume is only 1% of the total but the price is fairly close and will work until we get something better.

eprbell commented 1 year ago

Looking around, it looks like Gate.io will do in a pinch for both assets. The volume is only 1% of the total but the price is fairly close and will work until we get something better.

Sounds good! Thanks for finding this.

macanudo527 commented 1 year ago

Looking around, it looks like Gate.io will do in a pinch for both assets. The volume is only 1% of the total but the price is fairly close and will work until we get something better.

Sounds good! Thanks for finding this.

I spoke too soon. The new system doesn't allow the lookup of prices on exchanges where the market doesn't exist, yet. The side effect is that Binance, Gate.io, and even Bitget all added NEXO after I started trading it. Huobi and Bitfinex are the only high-profile exchanges that have it. I'm trying with Bitfinex, now.

macanudo527 commented 1 year ago

I have another question. I'm wondering how to save the bundle of bars we pull in order to optimize the graph. Normally, we cache individual bars one by one to the cache, but these bars will most likely only be pulled as a bundle, so it makes more sense to store them as a bundle in the cache.

Should I create a new caching function that takes a different key to store bundles?

The other option is to push it into the Kraken CSV plugin and have it write it out into the CSV. We will always need to pull from the CSV, so we can retrieve all the bars in one call that way. If I store in standard cache to disk, we will need two calls.

eprbell commented 1 year ago

Looking around, it looks like Gate.io will do in a pinch for both assets. The volume is only 1% of the total but the price is fairly close and will work until we get something better.

Sounds good! Thanks for finding this.

I spoke too soon. The new system doesn't allow the lookup of prices on exchanges where the market doesn't exist, yet. The side effect is that Binance, Gate.io, and even Bitget all added NEXO after I started trading it. Huobi and Bitfinex are the only high-profile exchanges that have it. I'm trying with Bitfinex, now.

OK, let me know if Bitfinex works, otherwise we have to think about a different strategy.

eprbell commented 1 year ago

I have another question. I'm wondering how to save the bundle of bars we pull in order to optimize the graph. Normally, we cache individual bars one by one to the cache, but these bars will most likely only be pulled as a bundle, so it makes more sense to store them as a bundle in the cache.

Should I create a new caching function that takes a different key to store bundles?

The other option is to push it into the Kraken CSV plugin and have it write it out into the CSV. We will always need to pull from the CSV, so we can retrieve all the bars in one call that way. If I store in standard cache to disk, we will need two calls.

The cache read/write functions already use Any-typed data, so I don't think we need new variants.

macanudo527 commented 1 year ago

OK, I got the bars caching issue resolved.

About speeding up the optimization process. I've narrowed down the quote assets to what is in the transactions plus the native fiat of the country in the dali configuration. This should narrow it down enough so that optimization only takes around 5 minutes or so for most people and once cached will only take about a minute to load up.

eprbell commented 1 year ago

OK, I got the bars caching issue resolved.

About speeding up the optimization process. I've narrowed down the quote assets to what is in the transactions plus the native fiat of the country in the dali configuration. This should narrow it down enough so that optimization only takes around 5 minutes or so for most people and once cached will only take about a minute to load up.

Fantastic! We can optionally do a profiler run later to identify further bottlenecks, but for now this is great!

macanudo527 commented 1 year ago

All right, one more fringe case that the optimizations have uncovered. Some assets are given to users before they are traded. Generally, tax law states that the cost basis is the price of the asset when a market does finally open. However, under the current system, the ccxt pair plugin rejects it because it can't find a price.

I think I will add a week (the asset I'm having trouble with was given to me 5 days before I could trade it anywhere) to the start of the graph for that asset. Then, when the price is pulled and the timestamps don't match (because the price is from the future) I can use that price but warn the user with an INFO.

eprbell commented 1 year ago

All right, one more fringe case that the optimizations have uncovered. Some assets are given to users before they are traded. Generally, tax law states that the cost basis is the price of the asset when a market does finally open. However, under the current system, the ccxt pair plugin rejects it because it can't find a price.

I think I will add a week (the asset I'm having trouble with was given to me 5 days before I could trade it anywhere) to the start of the graph for that asset. Then, when the price is pulled and the timestamps don't match (because the price is from the future) I can use that price but warn the user with an INFO.

Sounds good: you're finding lots of edge cases. Good work!

macanudo527 commented 1 year ago

I just discovered another issue. The Kraken REST API only returns the latest 720 bars. So, the farther we go back in the current quarter the less accurate the pricing becomes until new quarterly data comes available. That's probably confusing so let me give an example:

I request a price for BTC/USD for April 2nd, which is in 2Q and not available via CSV yet.CCXT will first request a 1 minute candle and not get it since it only retrieves the last 720 minutes. It continues to increase the timeframe until it can get the candle that includes April 2nd. This is probably 12 hour, since April 2nd is easily within 360 days from now. It will use that candle if I pull the data now, but later when CSV data comes available, it will use the 1 minute candle increasing the accuracy.

a 12 hour candle isn't too bad, but dali-rp2 has to waste calls for each timeframe. Each call takes 5 seconds. That means it can take 20-25 seconds for each transaction to resolve.

I hope that all makes sense. Talking to the Kraken team, they told me that I can construct OHLC from their trades endpoint which will give me all trades and their execution prices during a time period without restrictions, but that seems a little overkill, and maybe not worth it for just one exchange.

eprbell commented 1 year ago

I'm back from vacation and I'll try to catch up with everything over the next few days: thanks for holding down the fort!

Regarding the Kraken 720 bar issue, it seems to me we don't have a lot of choice: we'll have to use less precise prices until the CSV becomes available. Perhaps we can use an info log to let users know that prices from the last 3 months will probably change later. This is not great because it breaks user's golden-file-based test (if they have any), but I'm not sure what else we can do. Can you elaborate on OHLC construction from their trades endpoint? I'd just like to understand how that would work.

macanudo527 commented 1 year ago

Most exchanges will give you their raw trade log (ie. all trades performed on the exchange and at what price). Here is an example from Kraken:

{"error":[],"result":{"XXBTZUSD":[["8552.90000","0.03190270",1559347203.7998,"s","m",""],
["8552.90000","0.03155529",1559347203.8086,"s","m",""],["8552.90000","0.00510797",1559347203.9664,"s","m",""],
["8552.90000","0.09047336",1559347203.9789,"s","m",""],["8552.90000","0.00328738",1559347203.9847,"s","m",""],
["8552.90000","0.00492152",1559347203.9897,"s","m",""],["8552.90000","0.00201848",1559347203.9937,"s","m",""],
["8552.90000","0.11422068",1559347203.9993,"s","m",""],["8552.90000","0.00425858",1559347204.071,"s","m",""],
["8552.90000","0.00427679",1559347204.0762,"s","m",""],["8552.90000","0.06381401",1559347204.1662,"s","m",""]...
["8579.50000","0.05379597",1559350785.248,"s","l",""],["8579.50000","0.94620403",1559350785.2936,"s","l",""],
["8578.10000","0.45529068",1559350785.297,"s","l",""]],"last":"1559350785297011117"}}

You can then reconstruct the candle by grabbing the open, close, high, low values from the trades in a given window and then add volume. Of course, if you wanted super accurate prices you can literally take the exact trade you made and grab the price from that.

Here is more details from Kraken on how to pull trades.

FYI: XBT is what Kraken calls BTC on its API, but not its exchange or charts as if things aren't confusing enough.

macanudo527 commented 1 year ago

I implemented the warning, it fills up the console pretty quickly if someone has done a lot of trades, but maybe in an update we can put together some kind of summary.

eprbell commented 1 year ago

Thanks for the explanation. Regarding the warding filling up the console: perhaps we can print it only once so that it's easier to read (without transaction specifics, just a message saying that prices from Karaken from the last 3 months are not fully accurate and will improve in the next 3 months). I also suggest adding a FAQ explaining the issue.

Let me know when new code is up and I'll do one more round of review.

macanudo527 commented 1 year ago

There seems to be a bug in the macos build affecting bz2. Should I temporarily pin python version to 3.7.16?

macanudo527 commented 1 year ago

Other than the macos bug, this should be good to go. I can't fully test it because it just happens Binance.com has a launchpad now for MAV, which means I'm receiving an asset that doesn't have a market and doesn't exist on the graph.

How should I handle that case? Just warn users and assign a zero price, since that is actually its current value? The problem is there is no way of knowing if it is an untradeable asset or we just haven't added an alternative market to price it yet.

macanudo527 commented 1 year ago

@eprbell I think I'll just allow users to declare untradeable assets in the plugin configuration file. Then, we can assign ZERO to them until a market exists. I'll warn the user with an info.

macanudo527 commented 1 year ago

This works with real world data, and provides much more accurate pricing.

eprbell commented 1 year ago

This works with real world data, and provides much more accurate pricing.

Great! I'll do one more full review.

macanudo527 commented 1 year ago

@eprbell It looks like the macos bug has been fixed. You just need to add a snippet to the CI.

eprbell commented 1 year ago

I started reviewing your latest changes, but unfortunately Github's "Show changes since your last review" doesn't work for me anymore (it used to). So I have to do a full review from scratch. I intend to finish this by tomorrow (Sunday) night PST.

When I try to see changes since last review I now get: "We went looking everywhere, but couldn’t find those commits".

macanudo527 commented 1 year ago

It might be because I force pushed a few times. I made a lot of changes from the original plan because some things just didn't work out.

eprbell commented 1 year ago

It might be because I force pushed a few times. I made a lot of changes from the original plan because some things just didn't work out.

Oh, I see. In the future it would be best not to use push --force, unless you are in critical situations (e.g. if you're overwriting sensitive data that was checked in by mistake, etc.): it makes reviews considerably harder (especially big ones).

macanudo527 commented 1 year ago

OK, this is clean.