improve Sqlite reading perf

heejaechang commented 6 years ago

Current implementation of sqlite persistent storage always read full data before returning stream.

but the way we use this data is almost always checking some header from it and drop the stream if data is out of date.

we can either change the persistent interface to have some kind of version stuff to make this pattern more official and more efficient by not reading unnecessary blobs or make sqlite to lazily read data and fill stream as it gets consumed.

CyrusNajmabadi commented 6 years ago

This would be a good idea, but has serious implications that need to be considered. I avoided this initially because of the semantics of sqlite around streaming and blobs. Namely, the only safe way to read a blob is in a transaction. otherwise, you can fail while doing it if something else writes to that blob. So, if we returned a stream we'd have two choices:

not be in a transaction, but then the streaming read might fail and clients would have to be ready for that.
do it in a transaction, but release the transaction when the stream was disposed. This both has an issue where the DB stays in a transaction for potentially a long time, or a client forgets to dispose, and hte DB is totally stuck.

An alternative approach would be to update the persistence API to understand the concept of writing both a stream along with a checksum. This would then be in two different columns in the DB. Clients could then read only the checksum without reading the stream. If the checksum didn't match, they would know they could stop and could regen. if the checksum matched, they would then read hte stream. They would still need to check the checksum in the stream since there might be a race condition otherwise.

Personally, i think this would likely be the simplest and easiest thing to do here.

CyrusNajmabadi commented 6 years ago

Effectively, we would bump our sqlite internal format version to '2' from '1'. And we'd change the layout of the tables to:

        ///  -----------------------------------------------------------------
        ///  | DataId (primary key, integer) | Checksum (blob) | Data (blob) |
        ///  -----------------------------------------------------------------

Knowing this is a checksum, we'd ideally be able to read/write this directly in place in a allocation free manner.

heejaechang commented 6 years ago

sure, we can make TryReadStreamAsync(key, predicate, cancellation) so predicate can say true or false to return stream where predicate is basically getting the version/checksum parts.

so, creating SqlStream or something that internally hold onto transaction and SqlConnection to lazily read blob is too hard to implement? transaction can go away when the SqlStream is closed or disposed.

CyrusNajmabadi commented 6 years ago

so, creating SqlStream or something that internally hold onto transaction and SqlConnection to lazily read blob is too hard to implement? transaction can go away when the SqlStream is closed or disposed.

it's not hard. It's just very very very dangerous :) You're now basically making it so that some external client of hte persistence service can totally muck up the persistence service itself.

For example: remember that issue that Sam had to fix, where we were holding a connectin open across an async call? Now you'd have to worry about that sort of thing externally as well as internally to the persistence service. Internal is much easier to audit and maintain. External is super simple for a client to mess up.

I did not want to provide that originally because it would be so simple for a single mistake (i.e. missing dispose, resource held on to long, etc.) to screw things up :)

CyrusNajmabadi commented 6 years ago

sure, we can make TryReadStreamAsync(key, predicate, cancellation) so predicate can say true or false to return stream where predicate is basically getting the version/checksum parts.

Yup. That would be another way to potentially do things. that way this is all still done internally to the persistence service. I would still be worried in that 'predicate' would be called during a transaction. That's not necessarily a great design as a poor running predicate could have a pretty big effect on things.

--

Effectively, the SqlitePersistenceService is a shared resource. And operating with transactions is basically like working with normal locks. So all the regular 'best practices' still apply. i.e. don't run arbitrary, external, code within your locks. Try to keep your locks held briefly for shared resources. etc. etc.

--

Personally, i think the cleanest, simplest, and safest approach would be to explicitly encode the concept of the checksum into the IPersistenceStorage API. i.e. you'd have:

        Task<bool> WriteStreamAsync(Document document, string name, Checksum checksum, Stream stream, CancellationToken cancellationToken = default);

And

        Task<Checksum> ReadChecksumAsync(Document document, string name, CancellationToken cancellationToken = default);
        Task<Stream> TryReadStreamAsync(Document document, string name, Checksum checksum, CancellationToken cancellationToken = default);

So, when writing data, you could say the checksum you wanted to write with it. You could then read the checksum independently (to check it against your stored version). If it didn't match, you'd go regen the data. If it did match, you'd then TryReadStreamAsync (passing in that same checksum). If they still matched, you'd get the stream back. Otherwise, you'd get null back.

This approach would fit the need of this common pattern we have, without having to worry about anything like holding transactions open too long, or running arbitrary code while effectively holding locks and whatnot. It would codify the idea of the Checksum in the persistent storage subsytem. But i think that's totally ok. Data stored with checksums is an exceedingly common concept across many different systems. So this wouldn't be out of place :)

heejaechang commented 6 years ago

so, basically what you are saying is an opened SqlConnection is quite expensive, so we can't let other to hold it for a while?

I think the issue Sam had could be fixed differently by having cap on the pool since the issue was also that, the pool ever get bigger and never let any connection ever created go.

we can still have same worry for mis-use as well since not letting go stream can happen with current implementation and that will let us hold some big chunk of memory in the process.

if we are working about not calling Dispose but let stream go case, that can be handled by SafeHandle or finalizer as how we do that for all metadata created from references.

heejaechang commented 6 years ago

anyway, I am fine with having special checksum column, I just thought having lazy stream make consumer side simpler even though it makes implementation part more complex.

CyrusNajmabadi commented 6 years ago

Not exactly. I'm saying: in order ot read data safely from sqlite, you need to be in a transaction. And transactions are not cheap and should not be held open for arbitrarily long periods of time. It's very intentional and specific that we hold transactions only as long as necessary in the current impl.

The approach of running an arbitrary predicate while having a transaction open means we may hold transactions open for some indeterminate amount of time. THat's risky. Similarly, the approach of returning an stream that keeps the transaction open for an indeterminate amount of time is also risky.

Both are doable. But both open us up to many potential concerns that i would prefer to avoid altogether :)

--

we can still have same worry for mis-use as well since not letting go stream can happen with current implementation and that will let us hold some big chunk of memory in the process.

Right, but htat's a totally different level of concern. In the case you just mentioned, we may leak some memory. That's not great, but life goes on. IN the case i'm mentioning, it's possible that we simply break the persistence service entirely.

It's the same reason why it's a bad idea for features to have an internal lock/gate that then gets passed out to arbitrary clients. We try to never do that because it simply opens up your risk surface area hugely. These are critical resources, and it's good to make ito so that they're safe and auditable independently from all their clients.

heejaechang commented 6 years ago

Task WriteStreamAsync(Document document, string name, Checksum checksum, Stream stream, CancellationToken cancellationToken = default);

this force everyone to use checksum. I told you already several times that not every consumer of persistent service uses checksum. so it only works for those but nobody else.

that's why I thought the predicate. if having code there is a problem, we can make it specific type like VersionChecker or something, and have several implementations of it so we can kind of force only certain supported ones.

CyrusNajmabadi commented 6 years ago

this force everyone to use checksum.

No it doesn't. You'd only use checksums if you used those overloads. The existing overloads would sill work fine.

I told you already several times that not every consumer of persistent service uses checksum. so it only works for those but nobody else.

Existing consumers that don't use checksums can still use the existing surface area that doesn't need/require checksums.

that's why I thought the predicate.

Right. But, like i tried to explain, the predicate approach is concerning because it means running arbitrary code while holding what is effectively a giant lock on a shared resource. In general this is not something you want to do, because a single poorly behaving predicate can now impact other clients of the persistence service. For example, consider someone who doesn't know better and, inside the predicate, does stuff like query the workspace for info. i.e. they actually try to compute the checksum to check against while inside teh predicate. That's really really bad. That computation may take dozens of seconds, depending on how much needs to be computed. That's a ton of time where a transaction is held, and all users of the persistence service are affected.

heejaechang commented 6 years ago

I guess where I dont agree is where db is expensive so all its data must be copied over as soon as possible and connection must be closed as soon as possible.

that's usually no no for db. especially for select. you do select which get you a cursor in the db. and you read only data you need and move to next row as you process rather than copy everything over so that you can let the connection go.

I guess it might be a limitation of sqlite? where connection can't be held open for long time?

heejaechang commented 6 years ago

Existing consumers that don't use checksums can still use the existing surface area that doesn't need/require checksums.

but I want all of the persistent service consumers to use a new pattern as well so that they don't need to read whole data just to check versions.

...

if predicate is the problem, we can provide a certain number of predefined implementations that we know it is safe to run inside.

CyrusNajmabadi commented 6 years ago

but I want all of the persistent service consumers to use a new pattern as well so that they don't need to read whole data just to check versions.

If they're jsut checking versions... they can use the form i just specified. i.e. there can be a small blob of data that is the checksum/version and which can easily be read independently of the main bulk of hte data...

CyrusNajmabadi commented 6 years ago

if predicate is the problem, we can provide a certain number of predefined implementations that we know it is safe to run inside.

I'm not sure what this means. Can you clarify?

CyrusNajmabadi commented 6 years ago

Sorry, i think you may have taken 'Checksum' very literally. What i meant was: an auxiliary column that can store a smaller blob of information. This can be read/written very quickly, and can be used to determine if the larger blob of information is needed. THat's why i simply wrote it as:

        ///  -----------------------------------------------------------------
        ///  | DataId (primary key, integer) | Checksum (blob) | Data (blob) |
        ///  -----------------------------------------------------------------

By 'checksum' i just meant: it's the way clients can read/write info to check if 'data' is valid to use for themselves. Sorry if that wasn't clear!

I didn't mean that it could only be a 'Checksum' object, and i'm sorry that i made that unclear!

heejaechang commented 6 years ago

I am fine having checksum/version specific column, what I am not sure is on the API signature. it can't be checksum type since then, no other types of version can be used.

...

VersionChecker which has Check(Stream ...)

and we have predefined implementation such as ChecksumChecker, VersionChecker and etc. VersionChecker's ctor is private so no one outside can create one. and it has

VersionChecker.Create(checksum) VersioChecker.Create(VersionStamp) and etc

...

sure, since we are roslyn, people can create new VersionChecker with "private", but it is true for any code in Roslyn, so, that one, we need to rely on code review for someone to do something bad. but at least, outside of Roslyn, people can't do it.

CyrusNajmabadi commented 6 years ago

that's usually no no for db. especially for select. you do select which get you a cursor in the db. and you read only data you need and move to next row as you process rather than copy everything over so that you can let the connection go.

This doesn't apply here. Sqlite is fine for things like normal selects. Where this gets tricky is specifically blob reading. You either can do the approach where you have fast-select reading of blobs but you run the risk of the blob becoming unreadable underneath you. Or you put hte blob reading in the transaction, but keeping that transaction open has significant implications for perf, especially around the journal, and other areas. This is specifically due to sqlite's blob API.

sharwell commented 6 years ago

⚠️ We should not proceed with an implementation of this issue prior to defining an objective way to measure the performance under well-defined real-world scenarios. See #23583 and #26778 for examples. All changes should be supported by differential measurements, and [PerformanceSensitive] applied in cases where the performance improvements don't have automated regression tests.

CyrusNajmabadi commented 6 years ago

I am fine having checksum/version specific column, what I am not sure is on the API signature. it can't be checksum type since then, no other types of version can be used.

It would literally just be a stream/byte[]/something-that-represents-a-blob :)

The way you've described VersionChecker seems fine to me. It's an another indirection around the blob. But i wouldn't have a problem with that.

heejaechang commented 6 years ago

okay, let's not dig into whether holding transaction long time is okay or not. since we already know it is an issue with sqlite we have.

and yes. if the Checksum is not THE checksum, I am okay. having that will prevent us from reading bunch of unnecessary bits from sqlite and through away.

CyrusNajmabadi commented 6 years ago

and yes. if the Checksum is not THE checksum, I am okay. having that will prevent us from reading bunch of unnecessary bits from sqlite and through away.

Agreed. That would be fine with me. I'm also in agreement with what @sharwell mentioned. Though, this should be pretty easy to measure with some simple perf markers. An easy test would be:

open roslyn.sln, allow indices to get up to date.
close roslyn.sln.
switch to some fairly different branch.
reopen roslyn. see costs of index checking.

We can do the above both before/after hte change. markers should allows us to see how much less cpu/memory is required now that we're not having to read out hte full blobs for x% of the index files.

heejaechang commented 6 years ago

sure, manual testing is fine, but not doing until automated test seems to excessive.

...

for manual testing, it doesn't even need to switch to other branch. all we need to do is open solution, make sure everything wirtten, close solution and reopen exact same solution.

for indexing, since it sees all data is there, it will simply not do anything and move on.

sharwell commented 6 years ago

not doing until automated test seems to excessive.

Automated tests are not required; often they are either impossible or impractical. However, these cases should still have regression prevention in the form of clearly-articulated impact on a manual test that reproduces the issue.

for manual testing ...

Once you have the test plan, make sure to create a new issue with complete steps including:

The target project(s) involved (include commit hashes for any non-Roslyn projects like you see in #23583)
Steps to run the scenario, including the setup steps related to the directory contents that are likely applicable for this case
Start/stop locations for the measurement
The profiling tool and configuration
The counter(s) tracked by the issue

If you are working with both allocation and execution time profiling, the scenarios should be documented in separate issues so the relevant items can be directly linked when applying [PerformanceSensitive].

CyrusNajmabadi commented 6 years ago

Aside:

I guess where I dont agree is where db is expensive so all its data must be copied over as soon as possible and connection must be closed as soon as possible.

A few things to address there.

the connections don't need to be closed as soon as possible. Indeed, we try very hard to not close connections and to pool them as much as possible. We do need ot be careful to not hold onto connections such that as threads come in, they keep getting a new connection since all the pooled connections are currently in use by a thread.
for most things in sqlite, there is little cost to reading/writing data to/from rows. The 'select' example you mentioned is generally not a concern. Where there is a concern is specifically around blobs. Unlike all other data in sqlite columns, blob data is not safe to read outside a transaction. All other data is. i.e. if i read a row/column using a 'select' statement, the data i get back is totally safe no matter what else is happening. This is not the case for blobs.

Specifically:

If the row that a BLOB handle points to is modified by an UPDATE, DELETE, or by ON CONFLICT side-effects then the BLOB handle is marked as "expired". This is true if any column of the row is changed, even a column other than the one the BLOB handle is open on. Calls to sqlite3_blob_read() and sqlite3_blob_write() for an expired BLOB handle fail with a return code of SQLITE_ABORT.

So, for blobs, we need to ensure that the row cannot be modified. That's only possible by ensuring we're in a transaction. And transactions are not free. There is a lot of book keeping that has to go along with htem. And, in our highly concurrent system, we can end up having a lot of them happen simultaneously. These affect the bookkeepign of sqlite, the journal, and affect all other in-flight reads/writes as this this is going on. It affects vacuuming the data, puts pressure on the sqlite cleanup system, and can cause lots of data bloat (often at the page-table size) because sqlite has to keep around all versions of the data affected by teh transaction.

So, we try super-hard to minimize this. We are very good transaction citizens, avoiding taking transactions when unnecessary, and only using the transaction for the minimal time necessary.

CyrusNajmabadi commented 6 years ago

all we need to do is open solution, make sure everything wirtten, close solution and reopen exact same solution.

That would def be an important case (probably the most important case) to consider. It's the common case where the data is unchanged, and will likely show the most benefit.

I wanted to also test the less common, but also important case. i.e. a lot of data has changed and now we have to do the reading of the checksums and of the blobs.

heejaechang commented 6 years ago

sure, I get that sqlite has its own limitation on certain operations.

anyway, one question. there seems Microsoft.Data.Sqlite (from Microsoft) and System.Data.Sqlite (From Sqlite themselves). is there a reason we choose this bare-bone C wrapper over these .NET wrapper?

CyrusNajmabadi commented 6 years ago

is there a reason we choose this bare-bone C wrapper over these .NET wrapper?

Yes, there were a few reasons at the time. The documentation and communication around this has likely been lost. A major reason was to be able to control lifetimes, pooling and most critically allocations. There was also no need or want for rich ORM/entity functionality. Indeed, the goal was to strip out effectively anything unnecessary, since the is all effectively a key/value store.

Basically, sqlite is super simple and easy to use as is. Just using their existing API works great. So, "if it ain't broke... why fix it?"

heejaechang commented 6 years ago

but that also brings in bunch of code we need to maintain? where Sysmte.Data.Sqlite is maintained by Sqlite team itself and updated as new version of Sqlite is released? that seems to be able to reduce us code to maintain?

also, SystemData.Sqlite is open source, so if allocation or any other performance is an issue, we could contribute to the System.Data.Sqlite to improve those?

anyway, okay. I was curious why we choose to implement our own stuff rather than just use one provided by Sqlite

CyrusNajmabadi commented 6 years ago

but that also brings in bunch of code we need to maintain?

What code are you referring to? The sqlite code we have is exceptionally minimal.

where Sysmte.Data.Sqlite is maintained by Sqlite team itself and updated as new version of Sqlite is released?

Sure... but the library we use is also maintained, and updated and we can move to newer versions if appropriate as well.

also, SystemData.Sqlite is open source, so if allocation or any other performance is an issue, we could contribute to the System.Data.Sqlite to improve those?

So are the libraries we're currently using. :)

SqlitePcl is the backbone of tons of data libraries. It's even what backs Microsoft.Data.Sqlite. Furthermore sqlitepcl is a PCL library. That means it doesn't need full framework (which system.data.sqlite needs). Also, sqlitepcl is what is already used by other parts of VS, meaning we don't end up pulling in some other library when we've already got htat one.

anyway, okay. I was curious why we choose to implement our own stuff rather than just use one provided by Sqlite

Basically, because there are less issues and it's somewhat simpler and easier to use. It also means a simple mac, xamarin, .net core story, etc. etc.

heejaechang commented 6 years ago

System.Data.Sqlite supports .Net standard so, I think it is as portable as us since we are also .net standard portable.

I dont know, I looked System.Data.Sqlite and its API was consistent with other SQL apis .net has (since I guess it implements common interface) and familiar.

is using bare-bone actually brought in big perf win?

CyrusNajmabadi commented 6 years ago

Also, to put it in perspective, the entire amount of 'glue' code we use here is ~200 lines (not counting things like comments). it's really basic, and just ensures very clear semantics around how we're using the underlying API.

The majority of that is just in SqlStatement which exposes a very slim wrapper around sqlite3_stmt and makes it easy to pool and use the fnuctionality we need, and in SqlConnection, which us used to easily pool data (both blobs and statementS) finely track exactly the error cases that we expect to get, and unify error handling.

it's pretty darn simple, and doesn't really need a larger API surrounding this.

CyrusNajmabadi commented 6 years ago

is using bare-bone actually brought in big perf win?

I would ask differently. What is the value in using it when VS already uses sqlitepcl, and that library works great for our needs? It's a very simple and easy to use library. it's good enough for projects like the AzureSDK: https://www.nuget.org/packages/Microsoft.Azure.Mobile.Client.SQLiteStore. The sqlitepcl bundle has millions more downloads than the system.data.sqlite lib. etc. etc. etc.

Why would we switch away from something that the ecosystem has bought off on and which fits our needs very well? Furthermore... why is this even part of the discussion? Shouldn't we need to ahve a discussion about why switching away would actually be a good thing? if you think it would be appropriate, it would be good if you could do that investigation and report back.

i.e. figure out what the RPS story is here (i.e. loading different dlls). If there are cross-platform or different .net version concerns. If there are issues we run into because system.data.sqlite might uses classes instead of structs somewhere, and that might cause an increase in allocations etc. etc :)

I've given the reasons above for why we picked this. This was a discussion with other teams in VS, and this was felt to be the suitable choice for all the needs teams had. We've found it works very well, with minimal fuss for Roslyn itself. So re-litigating things now doesn't seem to actually serve any purpose :) If you want to go investigate here, and you get interesting results that would make VS rethink things, by all means go and collect. But right now it doesn't seem fruitful to have to discuss all these decisions that happened years ago :)

heejaechang commented 6 years ago

so, I shouldn't even ask about it? I don't get why asking about the decision is forbidden.

the reason I asked was, I saw the System.Data.Sqlite and Microsoft.Data.Sqlite and interface was familiar to me with many APIs I have used before for any SQL work in .NET

I looked your code, I can't say whether it is right or wrong since it requires me to read all Sqlite documentation to read the code. it asks me to understand all the details of the Sqlite.

so asked the question? which you do everytime for everyone's PR. I am not getting why you asking me why I am asking the question?

heejaechang commented 6 years ago

using standard APIs brings in many benefits I believe. it first removes learning curve for new people to use it. it gives lower hurdle for people to use Sql on other places directly rather than through IPersistentService.

it lets us to replace SQL engine to other one if we find better one or VS start to support SQL natively? and etc.

I am not saying so we need to change. I dont think I ever suggested that. I was just asking questions on the decision so, when we have a need for sql, we can guage whether using such .net data abstraction is better or adding more functionality on what we have is better.

heejaechang commented 6 years ago

it doesn't seem fruitful to have to discuss all these decisions that happened years ago

I guess that is the problem then, since it happened so long ago, not many people know why it happened that way. then, question naturally come up since people doesn't know about the history.

CyrusNajmabadi commented 6 years ago

using standard APIs brings in many benefits I believe. it first removes learning curve for new people to use it.

Sqlite is a standard API. It's one of the most widespread sql APIs in existence. Its api is available on every mainstream platform, as well as mobile, desktop, etc. etc. It's effectively universally supported and well known. That's one of the reasons we (and the rest of VS) picked it :)

it lets us to replace SQL engine to other one if we find better one or VS start to support SQL natively? and etc.

I don't see how that is changed by using sqlitepcl...

That's why this is all abstracted underneath IPersistentStorage. It's an implementation detail that is backed by sqlite using the most popular API out there (by many millions more in downloads), and the api that VS prefers.

If you want to replace the sqlengine... you're free to do so. And, if VS started supporting sql natively (not sure what that means...) but sure, you can then make this sit on that.

CyrusNajmabadi commented 6 years ago

I guess that is the problem then, since it happened so long ago, not many people know why it happened that way. then, question naturally come up since people doesn't know about the history.

It happened this way because sqlitepcl is a suitable and effective library for accessing sqlite :) It's very good at what it does, and is extremely popular (tens of millions of usages across all the versions it offers). It's good enough that other MS teams (like Azure and VS) use it :)

Basically, as i've stated, if you want something else to be used, there would be a need to show why that alternative was a better choice. :)

--

Note: this conversation is confusing me a bit. Is there something wrong with using what is effectively the actual API supplied by sqlite? It's extremely well known. It works super well. It can be used effectively on any platform. It doesn't seem to have any drawbacks. It's been battle tested over millions of downloads by tons of projects. Basically... it's not broken, so what's causing the concern and deliberation here?

CyrusNajmabadi commented 6 years ago

since it happened so long ago

I think this decision is only like 2 years old (maybe 3)... :)

then, question naturally come up

I feel like i've given a wealth of on why this was an appropriate pick. However, despite that, it feels like there is pushback happening here on this decision. That's the part i'm not getting. It's unclear to me what information i could provide that would satisfy this line of questioning, as all the information provided so far doesn't seem to have been sufficient.

heejaechang commented 6 years ago

I dont have any pushback. I simply asked about the decision. I didn't suggest changing it. I also had questions on the reasoning on the decision so I asked to know whether it was considered.

also, I have this desire to not use persistent service but use structured db directly for things like diagnostic or todo comments or designer attirbutes since I believe converting data to stream and converting it back just wasting a lot of allocations.

so wanted to see whether I should use the sql implementation hidden behind persistent service and it need to be brought out, or use regular .net data abstraction.

anyway, I wasn't trying to push back or question what it is now. sorry if that felt that way. thank you for letting me know about the history!

CyrusNajmabadi commented 6 years ago

Ok. Sorry for misinterpretting. Thanks for the clarification!

also, I have this desire to not use persistent service but use structured db directly for things like diagnostic or todo comments or designer attirbutes since I believe converting data to stream and converting it back just wasting a lot of allocations.

You can definitely do that. The sqlite interop code i created is very suitable to using directly. Note though that i wouldn't just jump in and use it before reading up a bit on the sqlite programming model and how things work with it.

The good thing is that sqlite is extremely well documented. So, for any functionality you need to use, you can look at their docs and get a complete sense of how to use it properly, as well as all the things you need to consider (especially around errors and whatnot). It is really important though to read and understand that stuff in depth before working with it though as things can go bad (as we've seen with things like shared_cache) if the implications are not well understood.

heejaechang commented 6 years ago

So, for any functionality you need to use, you can look at their docs and get a complete sense of how to use it properly, as well as all the things you need to consider (especially around errors and whatnot). It is really important though to read and understand that stuff in depth before working with it though as things can go bad (as we've seen with things like shared_cache) if the implications are not well understood.

so that is the reason I was asking about System.Data.Sqlite since it is abstraction over Sqlite so that It can be used in place where people use any .net SQL data abstraction.

I dont and I dont think people who consume SQL want to know details on Sqlite. they just want to consume it as they would consume any .NET SQL.

CyrusNajmabadi commented 6 years ago

I dont and I dont think people who consume SQL want to know details on Sqlite. they just want to consume it as they would consume any .NET SQL.

System.Data.Sqlite does not give you that. You still need to understand all this stuff. For example, if what you're doing would cause a 'LOCKED' or 'BUSY' error. Or how you need to properly open a connection to the DB. Or how you need to manage threading/lifetimes. Look at the chm docs for system.data.sqlite. They are quite involved and they require that you understand deeply all the implications of it.

heejaechang commented 6 years ago

if it requires such detail on db implementation, then, it might not that good one to use...

people uses SQL so that they can do simple stuff simple.

it sounds like for actual usage, it is even worse than esent. esent required unfamiliar/standard way to use it, that's why I wanted to move to Sql. but once it set up, usage wise, people didn't need to care much since esent took care of all those things.

sqlite seems, to use it you must be expert on sqlite and how it works.

I have some not all but at least, oracle, msql, mysql, never need to go too deeply on the db itself. I didn't need to read thier doc deeply to use it. it was sql server so as long as I sit on top of sql abstraction.

all basically behave similar except those advanced features they provide.

here, I am not even thinking using any advanced feature of sqlite, but just plain sql and that requires me to read and understand how sqlite works seems too much.

CyrusNajmabadi commented 6 years ago

people uses SQL so that they can do simple stuff simple.

I"m not sure what that means. As far as i know, there is no sql api out there that divorces you from having to understand and carefully consider all the ramifications of the APIs. To use sql effectively, you need to understand this stuff, and you need to do things judiciously. The parameters used to create the DB are important. How you partition your tables, indices and queries is important. How you work with DB resources (pooling, threading, disposing) is important. :)

Sqlite has the virtue of being staggeringly popular. It has widespread use among millions of devs, and is supported on every mainstream platform on the planet. It's so popular that OSs just build this in and include it for everyone to use. If you learn it, you can use it effectively basically anywhere :)

heejaechang commented 6 years ago

what I meant was, when I use sql before in any other db I listed before as long as I understand sql.

when I do create/select/update/delete through sql, I didn't need to care something like ah, will db in lock mode so I can't issue this command now? am I holding connection more than 5 seconds too expensive and etc. should the transaction be avoided in this situation since it can bring down the whole db?

I never did that. those are the responsibility of db, not API consumers.

sure, you can nick pick certain no no case and say see you need to care. sure, that kind of corner cases always exist in any area. but it is like 1 time out of 100 or less you need to care.

so far, what I am hearing is that, it is like when we ask people to write analyzer, they need to know all implementation detail of Roslyn to write one. sure there will be case where they need to, but most of times, I hope they shouldn't. need to if that is required, then analyzer API will be considered so hard to use.

...

anyway, it is starting to go circle again. so I will stop here.

CyrusNajmabadi commented 6 years ago

when I do create/select/update/delete through sql, I didn't need to care something like ah, will db in lock mode so I can't issue this command now? am I holding connection more than 5 seconds too expensive and etc. should the transaction be avoided in this situation since it can bring down the whole db?

You do need to care about this. Or, you were in an environment where problems here were not an issue for your business needs. Understanding the implications of your DB choices is something pretty much anyone needs ot understand in real production scenarios. Different DBs have wildly different characteristics, and what may work well in one, may be a significant issue in another. There are entire certification levels that come along with DBs because these issues can and do impact people and companies enormously.

For example, if you use postgresql you better not keep a transaction open the entire time. Otherwise the DB will never vacuum (even if autovacuum is on). You have to understand your individual DB and how their different index choices will affect your queries. If you use mysql, you need to know that it does case-insensitive string operations by default. In some sql systems, trailing whitespace is removed for varchar columns, and that's something you need to know and be prepared for. Date-handling has lots of subtle issues across DBs. Different DBs have entirely different approaches to locking, and there are many things that can cause deadlocks in one DB and not another. And the list goes on and on and on.

To use a DB effectively, you need to understand the subtleties of it. You are free to ignore all this, but you can and will likely run into issues when using things in production settings. This is true of sqlite. You could ignore this stuff and run in a very simplified mode, but then you'll likely have other issues you don't like (like poorer perf).

So, the best thing is to actually just read the docs and understand the DB system so you can use it well and you can understand the implications around things. If the DB is any sort of important part of your system, this is really important, lest you have to go redo things there to deal with problems down the line.

CyrusNajmabadi commented 4 years ago

This work was done. We now store checksums in sqlite so we don't need to read the whole row unless necessary.

dotnet / roslyn

improve Sqlite reading perf #30306