OdyseeTeam / chainquery

Chainquery parses and syncs the LBRY blockchain data into structured SQL
https://lbry.tech
MIT License
2.26k stars 42 forks source link

fix sql schema #35

Closed lyoshenka closed 6 years ago

lyoshenka commented 6 years ago

here's a dump of resources for bitcoin's schema. ours should be similar to the first image, but with the changes I mentioned on Slack.

https://i.stack.imgur.com/rMBfH.png https://gist.github.com/ademar/a54838eded33a86b9281 https://webbtc.com/api/schema

tiger5226 commented 6 years ago

Now I looked at that the other day, I have additional columns but what you are looking for is not a replication but to make sure the important parts match up right? I don't think it is far off. However, if you believe it is I will probably want to clear that up.

Also from our slack conversation, The reorg issue is solved. It is handling them now. I just needed to tweak the foreign keys ( also needed to properly shutdown go routines because of a whole other set of unrelated but dependent issues ).

I don't have the following tables:

block_tx - NM between block and transaction. I don't think this is necessary but can be added if we find a use case rather easily. block_txin - NM between block and vin. Not sure why we would want this. Simplify queries? unlinked_txin - Used for reorgs from what I can tell. pubkey - table for pubkey. I have these as columns on vout. Not sure why we need a separate table. chain_candidate - This is used for reorgs, but is not needed. Leveraging foreign keys is better. block_next - NM for next blocks. Used for reorgs which are handled already.

The other tables not mentioned which we do not have are just fluff.

A lot of the differences are because, a blockchain has the chaintree and stores all blocks,tx, vins, vouts whether they are part of the main chain or not. We only care about the main chain for Chainquery, so there is no need to store orphaned blocks, alternative chains tx etc. Chainquery uses a rollback mechanism and stays on the longest chain.

We are missing some columns as well. I reviewed them, and I purposefully left some of them out as I did not find them useful or were things that were already in the data or could be obtained easily.

tiger5226 commented 6 years ago

I think we do have to deal with the claim version issue. I can create a claim for a name or an update and get rid of the unique index on claim_id instead of just updating. Then we can always get the latest version.

Also I can make the owned by their respective output so if the output is removed the claim version is removed, just like how reorgs are handled.

lyoshenka commented 6 years ago

re: tables - i agree we don't need most of those. block_tx data can be stored on the tx itself. i think the fields on the tables should also be renamed somewhat to match whats there (e.g. we should be consistent about _id and _hash naming). we also need to drop the unused columns and tables in the db

re: claim versions - i think connecting them to outputs or blocks is the right way to go. then they are cleaned up automatically

tiger5226 commented 6 years ago

To SQLBoiler or Not To SQLBoiler

I put some thought into the dump SQLBoiler option. I was indifferent previously, but I keep thinking of the benefits of having an ORM. I really like the idea of how easy it is to add a column or a table. Looking at the code generation that it does, and there is a lot of useful code, I think it is non-trivial. To remove it, I also think it slows future development, and increases the chance of bugs. When I look at the benefits of dumping it, I really only see that we get better named columns, which is very trivial. An alternative benefit is that we could stop using BIGINT SERIAL IDs when the primary key is more than 1 column like outputs or inputs. This seems pretty standard but I don't see why we would use an ID vs the true primary key. Is the performance much better? Is the benefit trivial?

It doesn't change my view that this was a huge pain in the begging as the deficiencies of SQLBoiler really helped drive me mad. Some of it was just me not knowing the pro tips ahead of time. However, now that it works, it is pretty useful to me and makes schema changes easy and seamless. Love the useful code generation. Also after using squirrel for internal-apis, they seem like a natural match together considering the gaps in SQLBoiler.

What specifically besides the weird column names do you find holds us back or becomes additional debt? Maybe I just don't completely understand your view on this.

Claim Versioning

I have a concern here after thinking about this more and I now recall why I chose not to use versions. When we are performing a query, I don't want to have to sort also if I had versions I would have to perform a join because I would then have separate tables. Even so to make this efficient, I would still need a relationship between a claim(claimid) and its version.

What is the use case for querying old claim versions? What happens now is that if we update it, the row in the claim table is updated with that information and the claim is then tied to the latest output. The main reason to store versions that I can think of is if there is a reorg. Then we will want to remove it. This is a special case though. We do not want to rollback the update. The claim once spent is irrelevant anyway. In order to update a claim you need to spend the original. So it would get marked as spent anyway. We have access to the information still on the output which would just need to be processed. output has a claim_id column so we know which ones are claim related.

Regarding the proposal, how do we handle getting the latest version without sorting? We could add a reference column, but then we would have to post process any reorg to reset that value. Maybe a trigger could be used for this to set the column if the reference claim version is removed. Still requires a join though for querying metadata.

kauffj commented 6 years ago

I generally think it is quite reasonable to be skeptical of ORMs, but given that this project is specifically intended to ultimately support multiple export formats, it's a use case where ORM makes a lot of sense.

lyoshenka commented 6 years ago

I actually don't think an ORM makes that much sense here, or at least not sqlboiler. We have maybe a dozen queries. All they do is insert data into the db. There are no complicated relationships to manage, no need to manipulate rows as objects, etc. The only thing sqlboiler does that I think is really useful is that it generates table/column names from the schema, so its easy to tell what queries need to be updated when the schema changes. Other than that, I think it adds more limitations than benefits. @tiger5226 pointed out better-named columns and more logical structure as a benefit to dropping sqlboiler, and I actually think that's a big part of it. The column names and sql structure are the main things chainquery provides. We want that to be as perfect as possible. Writing queries against columns named is_n_s_f_w (for example) is really weird.

Regarding multiple export formats, I agree that chainquery should be built to support multiple backends. But that's unrelated to ORMs. That's more about structuring the code well to have separate and interchangeable components.

kauffj commented 6 years ago

Got it. I've seen ORMs be capable of generating the language-specific SQL schema based on a more general data schema and thought it might be play a similar role here.

I'll likely leave this conversation to the two of you moving forward.

lyoshenka commented 6 years ago

re:healthcheck - that's a cool idea but out of scope for chainquery. chainquery turns blockchain data into sql. healthcheck tracks availability of blobs on the dht network (unless I misunderstood what it is?). i can see healthcheck using chainquery, but it should be a separate project.

tiger5226 commented 6 years ago

sure thing. No you understood it well, one addition would be that data can be used for network support rewards. I had it encapsulated like that for that reason, so it could be reused. I guess it was more of a POC than anything. Never really used except locally. I agree it should be a separate project. I actual just adjusted the original post so it was clear that I was intending on removing those tables as it was not clear.

I would argue that the code generation, especially the binding, is really helpful. I guess we could just make this pretty generic, such that we read type fields. There is also the easy access and eager loading of relationships ( which we use heavily, relationships ). Then you consider the handling of null values as well and automatic conversion between go values and db values. Snake case is not alien and represents capital letters. In the example you provided is_n_s_f_w, the only reason it is not is_nsfw is because I preferred the field on the claim object to be IsNSFW. We could easily change that. The only other special naming comes from foreign keys which require the last 3 characters to be _id to prevent naming collisions (between fields and relationships) with the code generation. Another example of one that really irks me is transaction_by_hash_id. I severely regret doing this. I did it to differentiate between a foreign key to the id column of a transaction. This too could just simply be transaction_id.

Outside of the datastore queries, I have many in different areas, where I did not need all the information and leverage the query models pretty heavily. Again not a ton but enough that it is useful.

Generally speaking ( I have not designed many schemas as the software I work with, has db persistence built into the language itself), I don't see why I should use an id column that is BIGINT. This seems silly to me and more and more, I feel as though this is what you would do if you are creating your own data. This data is already there and has a primary key already. For example, I feel as though I should have the hash column of a transaction as the PRIMARY KEY, and not the id column. What are your thoughts on this? Is there really such a big deal in performance using BIGINT vs VARCHAR(40) as the primary key? Everything I read says the difference is negligible. However, most schemas I see out there follow this style. Again, not sure why, and I might just have to spend a bit more time reading about it.

I think this is a big decision, and I just want to make sure we make the right one with regard to sqlboiler or ORMs in general. I know you might not have all the details in front of you so I am just trying to make sure it is all clear. I expect many additions to the schema over time ( not many modifications). As we find new use cases we will most likely want to add additional structures to support that information gathering. Again right now it is very basic.

Lastly I did not see a response on the claim versioning question I proposed.

tiger5226 commented 6 years ago

Modifications

I collected all schema changes below.

Tables to be removed

Columns to be removed

Updates

tiger5226 commented 6 years ago

Initial version https://github.com/lbryio/chainquery/pull/39

I have made all the changes referenced above to the schema. It needs a review, but I did test it with a fresh schema. I did not test it against the leading blockchain database. I would like a review first because it takes a few hours to restore the database and I would like to try to minimize the number of restores I have to perform in testing this. If it looks ok I will run the upgrade against a local copy of the full db to make sure it works as expected before merging.

tiger5226 commented 6 years ago

merged and deployed