eprbell / dali-rp2

DaLI (Data Loader Interface) is a data loader and input generator for RP2 (https://pypi.org/project/rp2), the privacy-focused, free, open-source cryptocurrency tax calculator: DaLI removes the need to manually prepare RP2 input files. Just like RP2, DaLI is also free, open-source and it prioritizes user privacy.
https://pypi.org/project/dali-rp2/
Apache License 2.0
63 stars 42 forks source link

Weigh vertexes in Djikstra based on monthly volume for the CCXT pair converter #145

Closed macanudo527 closed 12 months ago

macanudo527 commented 1 year ago

Currently the Djikstra search doesn't take into account trading volume when calculating routes. This could lead to inaccurate pricing caused by low volume. I propose weights be based on monthly volume. This should provide enough granularity. I'm afraid if we use daily volume, the graph could take up too much memory and pulling the data could take too long.

These weights will actually be added to a priority queue of exchanges in MappedGraph and the current private variable that tracks this self.__exchange_markets will be removed.

eprbell commented 1 year ago

A question on this approach. Suppose I run DaLI on Apr 14 2023 and I get some results with certain pricing information reflecting the current volume. Then I run DaLI again on Apr 14 2024 and at that time volume data has completely changed: will the volume change affect transactions before Apr 14 2023? The reason I ask is that we should strive to preserve historical output as much as possible. In other words one way to test DaLI is to run it periodically and compare the output against golden files: such tests should continue to work.

macanudo527 commented 1 year ago

I think weights for the current month would have to be based on the previous month volume.

For example if you run Dali on April 14th and price a transaction for April 3rd the exchange with the most volume in March of that year would be used.

Alternatively, it could just be disabled for the most recent current month in order to keep consistent results.

Otherwise, the volume would be month by month and never change. The exchange with the highest volume for March of 2023 will still be the exchange with the highest volume for March of 2023 in April 2024.

The market priority queue would be sorted by month if that makes sense.

For example, we price a transaction from April 2022 and exchange A has more volume than exchange B for April 2022, we will use exchange A. The volume for exchange A will never change once it's completed so it will always be the same price.

The priority queue for markets will choose which exchange to use for a given vertex.

We have similar problem with vertexes in general though. The volume of transactions for BTC -> EUR is actually less than BTC -> USDT on a lot of exchanges and some stable coins are more popular on different exchanges at different times, so the quote priority will have to be created based on monthly volume as well.

eprbell commented 1 year ago

Thanks, that makes sense. I guess the only time this would break price continuity is if a new exchange gets added, which may have largest volume for certain intervals in the past. But I think that particular case is probably OK and hopefully it doesn't happen very often.

macanudo527 commented 1 year ago

It looks like this will have to be implemented to resolve the issue I'm having with BETH pricing. I need to mark certain markets unavailable for the months before they become available. What is happening right now is that a BETH->USDT market exists on Binance.com, but I had BETH before March 2023, which is when it started trading. It keeps using that market since it is a shorter hop than the previous route (BETH->ETH->USDT).

macanudo527 commented 1 year ago

I have 3 ideas of how to achieve these multiple graphs:

  1. Inherit from the Vertex and Graph Classes - create a vertex class that allows for different states. These states correspond to years and months. To save on memory, these states could even be saved as ranges (e.g. before April 2023, the weight is 10,000, which would cause Dijkstra to route around it) The Graph class would also need to be overridden in order to make use of the states of course. This would be done in abstract_pair_converter.
  2. Have a dictionary with multiple graphs one for each year/month - This would be the most straightforward, but less efficient in terms of memory, and would prevent us from making use of ranges and other optimizations.
  3. Have a dictionary that just keeps track of the differences between the years/months - Every time the changes would be applied to the graph based on the year/month of the price being requested. The graph would then be reverted for the next time a price is needed.

I think 1) is the best option. It would provide efficiency and should not cost much in speed. I'm guessing you don't want this in prezzemolo, so I would be overriding the classes in dali.

2) is straightforward and easy enough, but we could potentially have 10 (years of crypto trading) X 12 = 120 graphs in memory, each with ~2000 vertexes. I can't imagine a way to save memory with ranges.

3) We can utilize ranges as well but making changes and reverting the changes seems cost prohibitive to me.

That's all the ways I can imagine implementing it. Is there something I'm missing or do you have another idea?

eprbell commented 1 year ago

My main observation when it comes to subclasses is that I'd like the Graph and Vertex APIs to stay mostly the same (unless there is a pressing need to add something that can't be achieved without a new method). How would a search look like in this proposal? Perhaps it would be easier to discuss if you could you write up a prototype of option 1 in a PR.

Side note: one way to represent objects ordered by time is to use AVL Trees (similar to what the accounting engine does): not saying this is the only approach in this case, just a suggestion.

Option 2 is fairly clear to me: why is it not possible to use ranges to optimize it? Perhaps you could have an AVL Tree of Graph objects ordered by date.

macanudo527 commented 1 year ago

For option 1, basically a vertex would have multiple neighbor dicts and weight dicts one for each month that would be held in a dict with a key of the month/year. The Dijkstra search would then be overwritten in order to take year/month that would be used to recall the neighbors and weights for that year/month.

Vertexes would lazy implement the multiple dict per year/month. The majority of them will never need that functionality.

Basically in option 1, only the vertexes that change will require extra memory and dict to store the changes. Most of the vertexes will not need to change. In option 2, all of the vertexes will be copied in memory for each graph. However, most will not be used during pricing and will not need their weights modified, so this is redundant.

For example, Binance.com produces around 2000 vertexes. But, most investors will only be using 10~20 of these. There are exceptions of course, but I'm guessing most people stick to around 20 crypto assets. Only these vertexes would be different between graphs, the rest of the graph will be the same as all the other graphs for the other months. What I'm trying to do with option 1 is efficiently track just the differences between graphs and not the majority of the graph which will stay the same.

We can use option 2 and have a master graph that we lazy copy and modify when a price is queried for that given month and then sort them all with a AVL Tree. That seems pretty optimized. If I run into memory issues I'll try something else.

eprbell commented 1 year ago

Going with option 2 and checking how it goes sounds like a good plan to me. I also wanted to understand the details of data structures a little more in detail: would it be a time-ordered AVL tree of Graphs or is there something more (wasn't sure what you meant by "master graph")?

macanudo527 commented 1 year ago

Maybe a better term than master graph would be original graph.

Basically, the neighbors won't need to be modified, only the weights, so once a graph is made it just needs to be updated for the correct weights for that year/month. The new graph would then be sorted by AVL tree.

EDIT: I just realized you can't reset or modify weights once they are set, you have to rebuild the graph and re-add neighbors. Is there a way to optimize this by exposing the weights?

eprbell commented 1 year ago

One of the design principles RP2 and DaLI try to adhere to is immutability: data structures are read-only, which removes an entire category of bugs. There are a few exceptions to this rule, when immutability would cause algorithm complexity to increase too much (e.g. AVL tree nodes). So if at all possible, I'd like to avoid adding mutability to Graph.

The way I was thinking about it is (in pseudocode):

# create tree
timeline: AVLTree[datetime, MappedGraph]() = AVLTree()
...
# lazily insert new graphs in the timeline, when needed
new_graph = MappedGraph(...)
timeline.insert_node(somedate, new_graph)
...
# get graph for a given date 
graph = timeline.find_max_value_less_than(someotherdate)

The idea here is that there is one AVL tree that contains multiple graphs, each of which is a "snapshot" in time of the graph. Does this work?

macanudo527 commented 1 year ago

I totally understand all that. I'm just curious about how to basically copy the graph. It seems redundant to rebuild the entire thing from scratch.

Whenever we optimize the graph we will need a whole new graph that is a snapshot as you say. That graph will basically be the same as before with only a handful of edges modified.

So do I rebuild it with primary and alternative markets every time? If that is the case I need to cache the markets from the exchange so that they can be rebuilt. Otherwise I have to pull them again via REST, which is inefficient.

Ideally, I would like to clone the original unmodified graph and in the process optimize the weights. But it seems like you would like me to rebuild from the markets each time, optimizing as the graph is being built. Is that correct?

I guess what I would really like to have is something like this:

graph.clone(("BTC", "USDT", 1), ("BTC", "USDC", 2), ...

That will give me an immutable copy of the original graph but with the weights I specified as the third property of the tuples.

Because we will be optimizing these graphs over and over, each time we pull a price for a new asset for a new month.

eprbell commented 1 year ago

Will the clone graph have the same topology as the original one with just some weights changed or will the topology itself change?

macanudo527 commented 1 year ago

Yeah sorry, I should be using better terminology. The topology doesn't need to change at this time. However, it would be handy to add neighbors at certain times for assets that were listed and then subsequently delisted. It's not a necessity at this time though. Plus, this market/edge would not be easy to discover dynamically since it may not appear in the exchange information that lists all currently available markets.

Deletion of neighbors will be necessary to account for the span of time that a market didn't exist before it was listed on a given exchange. However, neighbors can effectively be deleted by adding an abnormally high weight (e.g. 10,000), so the topology would not need to be altered.

The idea that I came up with while pondering this today is to cycle through the vertexes from the graph and declare new vertexes using information from them, copying over their neighbors and weights as well. During the copy process, if one of the edges described in the tuples passed to the clone() function above is found, it will replace the weight for it.

One graph may need to be optimized several times, once for each asset that is being searched. Each time a new graph will need to be created and inserted into the AVLTree, replacing the previous graph. The new graph will need optimizations in it from the previous iterations of optimization.

eprbell commented 1 year ago

That makes sense. The clone function can receive:

macanudo527 commented 1 year ago

Sorry for jumping back and forth here, but I figured we need to go back to the drawing board on this. I moved this comment over from the PR.

I'm afraid there has been a misunderstanding here because it doesn't look like I can't implement this using AVLTree from prezzemolo. I will need to replace nodes in the tree, and it appears the AVL tree is immutable.

I think it would be possible if all transactions were known beforehand, but the optimizations will have to be done on the fly.

My original idea was to have a dictionary with keys as a month/year that holds another dictionary of graphs organized by exchanges, which is how it is currently implemented:

{ month/year : { exchange : graph}}

This will need to be updated regularly as the plugin works through the transactions, so it would need to be mutable. If you can think of an immutable solution we can use that, but I don't see how it would be possible.

macanudo527 commented 1 year ago

All right, I think I have it now. We can pull weekly candles from the exchanges for every market when we go to build the graph for the first time, and build all of the snapshots from that over time. Maybe that was what you were thinking from the beginning?

Pulling for Binance and Kraken will probably take 15-20 minutes. The last quarter for kraken will take the longest. That might end up adding 20 minutes or more to building the graph. If we add more exchanges, this would add to the time, but the solution would be immutable.

To speed it up, we could ask users to declare the assets they trade with in the plugin settings and when their first trade was, this will speed up the cache building process immensely. If they miss an asset or the start date we can throw an error that the graph is not optimized for that time and advise them to make changes. If they do not declare anything, we can warn them with an info log and at the end of the first optimization tell them what to put in the configuration file.

macanudo527 commented 1 year ago

Is it possible to pass a manifest or summary from dali_main.jp that lists all the assets that need to be priced, a start date of transactions, and total transactions? This can either be built by cycling through all the transactions quickly and building one, or asking all input plugins to provide themselves. If I had that optimizing the graph would be a lot faster and easier. It would also be helpful for creating some kind of mechanism that can estimate how long a full pull will take.

eprbell commented 1 year ago

Immutability is aspirational, but we shouldn't let it become a burden. If the immutable solution is measurably more complex or slower than the mutable one, it's OK to go for the mutable solution. The prezzemolo AVL tree needs to implement the delete_node method to become mutable, however using AVL trees was just an idea. I haven't done detailed time/space complexity analysis on this algorithm, but probably using dictionaries is a reasonable first approach (especially if we give up immutability). If we find performance problems we can always reassess and investigate other data structures. How would estimated performance change if we give up immutability?

eprbell commented 1 year ago

Yes, that should be possible. We'll need to rearrange slightly the code in main: move input plugin loads a few lines above (just above pair converter allocation), so that the resulting transactions (or maybe just their assets) can be passed to pair converters. Does that make sense?

macanudo527 commented 1 year ago

I think we could just have a method called optimize() for the pair converters that takes the manifest. We can cycle through the plugins and see if they have the attribute and hand it off after the transactions have been processed but before prices have been pulled.

The graph is lazy load, so it won't get built/optimized unless it is needed. It doesn't get built when the plugin is initialized.

eprbell commented 1 year ago

By manifest you mean the list of assets, right? That works, but I would just add the optimize method to the superclass and have empty implementations in subclasses where it's not needed.

macanudo527 commented 1 year ago

Yeah, I'm guessing manifest is more descriptive? This would be a NamedTuple with a list of the assets and when the first transaction occurred. I'll define it in the superclass.

macanudo527 commented 12 months ago

This has been fixed with #146, but uses weekly volume since that is more widely supported.