Possible multi-threading issue in Roslyn's sqlite usage

Therzok commented 6 years ago

Version Used: 2.9.0-beta4-63001-02

Steps to Reproduce:

Not sure about the steps to reproduce, it happens seemingly random in VSMac

Expected Behavior:

Thread-safety is well defined for SQLite usage.

Actual Behavior: Thread-safety is not well defined in some cases.

Looking at the value of SQLitePCL.raw.sqlite3_threadsafe () advised here, we get 1 back, which means that we get full thread safety via Serialized.

So far so good, that means sqlite will handle synchronization.

The SqlConnection to the database used is done using SQLITE_OPEN_NOMUTEX, which is not the equivalent of the default, Serialized, but Multi-threaded, so it does not handle mutexes on the prepared statements.

The locked database is not the same as a busy database, as per the docs. This means that we're doing hitting case 3 from that link, which is running 2 selects in parallel with no mutexes.

ERROR [2018-08-30 08:32:59Z]: Roslyn error: StorageDatabase_Exceptions Kind=Microsoft.CodeAnalysis.SQLite.Interop.SqlException|Reason=Microsoft.CodeAnalysis.SQLite.Interop.SqlException: database table is locked: SolutionData1
database table is locked
  at Microsoft.CodeAnalysis.SQLite.Interop.SqlConnection.Throw (SQLitePCL.sqlite3 handle, Microsoft.CodeAnalysis.SQLite.Interop.Result result) [0x00000] in /_/src/Workspaces/Core/Desktop/Workspace/SQLite/Interop/SqlConnection.cs:284 
  at Microsoft.CodeAnalysis.SQLite.Interop.SqlConnection.ThrowIfNotOk (SQLitePCL.sqlite3 handle, Microsoft.CodeAnalysis.SQLite.Interop.Result result) [0x00003] in /_/src/Workspaces/Core/Desktop/Workspace/SQLite/Interop/SqlConnection.cs:276 
  at Microsoft.CodeAnalysis.SQLite.Interop.SqlConnection.ThrowIfNotOk (Microsoft.CodeAnalysis.SQLite.Interop.Result result) [0x00000] in /_/src/Workspaces/Core/Desktop/Workspace/SQLite/Interop/SqlConnection.cs:270 
  at Microsoft.CodeAnalysis.SQLite.Interop.SqlConnection.ThrowIfNotOk (System.Int32 result) [0x00000] in /_/src/Workspaces/Core/Desktop/Workspace/SQLite/Interop/SqlConnection.cs:267 
  at Microsoft.CodeAnalysis.SQLite.Interop.SqlStatement.Reset () [0x00000] in /_/src/Workspaces/Core/Desktop/Workspace/SQLite/Interop/SqlStatement.cs:43 
  at Microsoft.CodeAnalysis.SQLite.Interop.ResettableSqlStatement.Dispose () [0x00000] in /_/src/Workspaces/Core/Desktop/Workspace/SQLite/Interop/ResettableSqlStatement.cs:34 
  at Microsoft.CodeAnalysis.SQLite.SQLitePersistentStorage+Accessor`3[TKey,TWriteQueueKey,TDatabaseId].TryGetRowIdWorker (Microsoft.CodeAnalysis.SQLite.Interop.SqlConnection connection, TDatabaseId dataId, System.Int64& rowId) [0x00047] in /_/src/Workspaces/Core/Desktop/Workspace/SQLite/SQLitePersistentStorage.Accessor.cs:209 
  at Microsoft.CodeAnalysis.SQLite.SQLitePersistentStorage+Accessor`3[TKey,TWriteQueueKey,TDatabaseId].TryGetRowId (Microsoft.CodeAnalysis.SQLite.Interop.SqlConnection connection, TDatabaseId dataId, System.Int64& rowId) [0x00000] in /_/src/Workspaces/Core/Desktop/Workspace/SQLite/SQLitePersistentStorage.Accessor.cs:183 
  at Microsoft.CodeAnalysis.SQLite.SQLitePersistentStorage+Accessor`3[TKey,TWriteQueueKey,TDatabaseId].ReadBlob (Microsoft.CodeAnalysis.SQLite.Interop.SqlConnection connection, TDatabaseId dataId) [0x00000] in /_/src/Workspaces/Core/Desktop/Workspace/SQLite/SQLitePersistentStorage.Accessor.cs:142 
  at Microsoft.CodeAnalysis.SQLite.SQLitePersistentStorage+Accessor`3+<ReadStreamAsync>d__11[TKey,TWriteQueueKey,TDatabaseId].MoveNext () [0x000f6] in /_/src/Workspaces/Core/Desktop/Workspace/SQLite/SQLitePersistentStorage.Accessor.cs:82

CyrusNajmabadi commented 6 years ago

From: https://sqlite.org/threadsafe.html

The threading mode for an individual database connection is determined by flags given as the third argument to sqlite3_open_v2(). The SQLITE_OPEN_NOMUTEX flag causes the database connection to be in the multi-thread mode

Multi-thread mode says:

Multi-thread. In this mode, SQLite can be safely used by multiple threads provided that no single database connection is used simultaneously in two or more threads.

We only give out connections to be used by a single thread at a time. That's what PooledConnection is for. People can only make requests when holding a PooledConnection. And a real connection is only created and given out to one pooled connection at a time.

CyrusNajmabadi commented 6 years ago

Note: one thing you coudl try to do to see if we have a connection shared between threads woudl be to add an 'ownerThreadId' field to SqlConnection saying which thread it is owned by. When the connection is handed out, that is set to the thread-id of the current thread. When it is returned it is cleared.

Then:

whenever a connection is handed out, it's owner-thread-id better be in the 'cleared' state before it is assigned to the current thread's id.
any operations on the sqlconnection should check that the current thread-id is the same as the 'owner thread id'.

If any of these checks fail, we know we have some sort of race whereby multiple threads have access to the same connection. I did a manual audit of the code just now. And this part looks simple and correct. But this would help narrow this down.

CyrusNajmabadi commented 6 years ago

Note: the docs on SQLITE_LOCKED say this:

The SQLITE_LOCKED result code indicates that a write operation could not continue because of a conflict within the same database connection or a conflict with a different database connection that uses a shared cache.

For example, a DROP TABLE statement cannot be run while another thread is reading from that table on the same database connection because dropping the table would delete the table out from under the concurrent reader.

The SQLITE_LOCKED result code differs from SQLITE_BUSY in that SQLITE_LOCKED indicates a conflict on the same database connection (or on a connection with a shared cache) whereas SQLITE_BUSY indicates a conflict with a different database connection, probably in a different process.

Note this part: "or a conflict with a different database connection that uses a shared cache"

So, is it possible the shared-cache part is what's biting us here? That code was added by @heejaechang in:

https://github.com/dotnet/roslyn/pull/21532

@heejaechang Can you advise. Is shared_cache safe to use with open_nomutex?

CyrusNajmabadi commented 6 years ago

Yes. I think that's the problem. The docs on sqlite about shared_cache say:

When two or more connections use a shared-cache, locks are used to serialize concurrent access attempts on a per-table basis. Tables support two types of locks, "read-locks" and "write-locks". Locks are granted to connections - at any one time, each database connection has either a read-lock, write-lock or no lock on each database table.

At any one time, a single table may have any number of active read-locks or a single active write lock. To read data a table, a connection must first obtain a read-lock. To write to a table, a connection must obtain a write-lock on that table. If a required table lock cannot be obtained, the query fails and SQLITE_LOCKED is returned to the caller.

Effectively, once you have shared_cache on, you have to ensure that you do not have a thread trying to write to a table that any other thread is reading from. You need though each thread has it's own independent connection object.

--

Based on this, we have a couple of approaches we can take:

rollback this change.
ensure our application level code does not violate this constraint.

I tink we can safely do '2'. Effectively, we would could use a ReaderWriterLock in our code to ensure that different threads were not stomping on each other.

--

Tagging @sharwell @jinujoseph as well as i think @heejaechang may be currently unavailable.

sharwell commented 6 years ago

I tink we can safely do '2'. Effectively, we would could use a ReaderWriterLock in our code to ensure that different threads were not stomping on each other.

Agree that this should not be too difficult.

CyrusNajmabadi commented 6 years ago

One thing i don't understand is why this would be seen on mac, but not on windows. Something definitely seems fishy here..

Therzok commented 6 years ago

One curious thing while reviewing the code is that the _writeQueueGate is not initialized with a maximum of 1 at the same time, only initial.

If anything happens, and WaitAsync returns false here, we could end up with the total possible writers at once > 1.

The return value of WaitAsync should be checked.

https://gist.github.com/Therzok/984678d16bf9d764d8cff4a9bc93e03d

Therzok commented 6 years ago

Actually, still shouldn't hurt, it still should use a different sqlconnection.

CyrusNajmabadi commented 6 years ago

'initialCount' is the right thing to set. From: https://docs.microsoft.com/en-us/dotnet/api/system.threading.semaphoreslim.-ctor?view=netframework-4.7.2#System_Threading_SemaphoreSlim__ctor_System_Int32_

SemaphoreSlim(Int32)

Initializes a new instance of the SemaphoreSlim class, specifying the initial number of requests that can be granted concurrently.

Each caller that calls WaitAsync will decrement that, but only to the zero point, and not past that. So we can only ever have 1 thread accessing the resource. That thread always calls Release (which is the point of hte using), which will then only ever raise the count back up to 1.

We could set Maximum as well to help ensure no bugs happen. Specifically that no double-Release ever happens. But we don't technically need it for correctness.

CyrusNajmabadi commented 6 years ago

The return value of WaitAsync should be checked.

WaitAsync just returns a Task. There is other return value. We await it with a cancellation token. The contract is either cancellation occurred (in which case you get an OperationCanceledException thrown), or you acquire the gate asynchronously, in which case code proceed and the SemaphoreSlim count is now 0 (and thus all other requests to grab it will wait until it goes back up to 1).

CyrusNajmabadi commented 6 years ago

Your code has an issue. Specifically:

sem.Release();

await sem.WaitAsync();
await sem.WaitAsync();

You are Releasing hte semaphore, thus causing its count to go to '2'. So, of course, it can be waited on successfully twice. We only Release hte semaphore in the sqlite codebase after we have successfully WaitAsynced on it. So the count will only flip/flop between 1 and 0.

Therzok commented 6 years ago

Ah, I read the return value of the timeout overload which would return false, and the code would always release in the Dispose, thus going over 1.

CyrusNajmabadi commented 6 years ago

No worries :) It's good to have someone checking my work and making sure i'm not totally off base :)

heejaechang commented 6 years ago

what is the way to make Sqlite to let people to modify db concurrently under transaction? ultimately that was what I was trying to get to.

CyrusNajmabadi commented 6 years ago

what is the way to make Sqlite to let people to modify db concurrently under transaction? ultimately that was what I was trying to get to.

https://www.sqlite.org/wal.html

The WAL allows concurrent reads and writes. However, even with WAL there cannot be concurrent writes. I do not believe that is something that sqllite allows at all (at any level).

"Writers merely append new content to the end of the WAL file. Because writers do nothing that would interfere with the actions of readers, writers and readers can run at the same time. However, since there is only one WAL file, there can only be one writer at a time."

That said, is there a reason you need to be able to allow multiple concurrent DB writes under a transaction? The Roslyn layer batches requests together and then issues a full batch under a single transaction. That's the purpose of SQLitePersistentStorage_WriteBatching.cs and the method "FlushAllPendingWritesAsync".

In general, this layer should be very fast. Writes that come in from the rest of Roslyn are simply written into a rotating pool of buffers**. This should be very fast as we're just reading from the stream into pooled byte[]s. Then, every 500ms, if htere is work to do, we wake up, take all the pending writes, and merge them into the DB in a single transaction.

--

** The pool is 1024 outstanding requests, of up to 4k writes each. So, effectively, we keep a 4MB rotating buffer to store pending writes before flushing them to sqlite twice a second.

heejaechang commented 6 years ago

I am a bit skewed due to esent. esent supported concurrent usage under transaction. and that let things to be very simple since it is esent that takes care, not consumer. and that made memory more efficient since it shares memory and etc.

without shared connection, sqlite seems basically creating completely isolated connection each time.

CyrusNajmabadi commented 6 years ago

I am a bit skewed due to esent. esent supported concurrent usage under transaction. and that let things to be very simple since it is esent that takes care, not consumer

Right. But esent doesn't work for cross platform. It's Windows specific. It also can't be shared amongst processes (like the OOP process and the VS process). Sqlite, on the other hand, is universally supported, high performance, trivially shared across processes, etc. etc.

Being shareable was a core part of the OOP feature design as we had indices taht would be used by some in-proc features as well as indices that are used by some out of proc features. Having the DB be shareable means one side can benefit from the work the other side does.

without shared connection, sqlite seems basically creating completely isolated connection each time.

Yes. That's part of the design of sqlite. You cannot share connections across threads simultaneously. Or, if you do, you need to configure it with the appropriate locking strategy. For us, the approach that works nicely is that we buffer writes and do them in batches. We are careful to manage our connections and thread-safety (prior to this recent change) to ensure everything is safe and reliable.

Because we are write-heavy, we also use the WriteAheadLog mode for it as that tends to perform better.

--

Note: it is unclear ot me why we moved to SHARED_CACHE mode. Did it result in a measurable perf boost? If so, i'm ok with it. But we do need to respect the constraints on shared_cache, specifically how we appropriately create and use SqlConnection instances.

CyrusNajmabadi commented 6 years ago

I am a bit skewed due to esent. esent supported concurrent usage under transaction

Note: i should be more crisp about what i said before. Nothing is wrong with opening transactions for writing across multiple threads using different SqlConnection instances using standard defaults. That's all legal. It's just that sqlite will serialize out the writes one at a time. This is totally fine though as these writes are super quick since all we're doing is batching a large set of writes up to write every .5 seconds.

However, the above is no longer true with SHARED_CACHE. By enabling that, you are effectively saying: i want shared memory between all my connections and I will manage the locking strategy myself ot make sure they are not conflicting. Now, the responsibility is on you to determine an appropriate locking strategy over that shared memory. If you don't want to do that, the solutoin is easy: don't use SHARED_CACHE.

CyrusNajmabadi commented 6 years ago

As an example: Some FindRef APIs will never be remotable. That's because they don't have enough information to properly remote the args/results to/from the OOP server. For example:

SymbolFinder_FindReferences_Legacy.cs

    // This file contains the legacy FindReferences APIs.  The APIs are legacy because they
    // do not contain enough information for us to effectively remote them over to the OOP
    // process to do the work.  Specifically, they lack the "current project context" necessary
    // to be able to effectively serialize symbols to/from the remote process.

    public static partial class SymbolFinder
    {

If anyone calls into these APIs, they will always run in-proc. As such, we want to share the indices across both processes so that both sides don't have to compute the same information.

heejaechang commented 6 years ago

the reason I start to see sqlite was due to I see in NFW - SQLConnection busy exception. and moved to shared_cache.

anyway, isn't the connection had a mutex before? can't remember exactly though.

...

in proc and out of proc uses 2 different db, so not sure how they can share.

CyrusNajmabadi commented 6 years ago

anyway, isn't the connection had a mutex before? can't remember exactly though.

I don't believe so. We should have been opening the DB with OpenFlags.SQLITE_OPEN_CREATE | OpenFlags.SQLITE_OPEN_READWRITE which would ensure that the DB would serialize writes and ensure they were not running against reads simultaneous. SHARED_CACHE undoes that. My recommendation would be to use a reader-writer-slim here if you continue wanting to use SHARED_CACHE.

CyrusNajmabadi commented 6 years ago

in proc and out of proc uses 2 different db, so not sure how they can share.

Note: if this is hte case, and there is no intent to make it use a single DB again, then a lot of the Sqlite code can be simplified. The code currently is written with teh assumption that another process may be editing the same DB. So, for example:

https://github.com/dotnet/roslyn/blob/3c69b4bb837c4998028711291f63068c9bbf57f7/src/Workspaces/Core/Desktop/Workspace/SQLite/SQLitePersistentStorage_BulkPopulateIds.cs#L46-L62

Note: this is only safe if there is something higher up that prevents multiple instances of VS from being able to use the same DB. Right now, we have no problem sharing the same DB among several instances (which i think was an issue for esent as well).

heejaechang commented 6 years ago

we no longer share the same db between 2 processes. due to it was so slow (whole db got locked up) and got worse since more than one VS can share same db messing up the data inside and locking up whole dbs.

unless we can find a way to make the db fast enough with it being shared between multiple processes, i dont think we can share the db.

heejaechang commented 6 years ago

w, we have no problem sharing the same DB among several instances (which i think was an issue for esent as well).

esent, the first process that open the db owns it. current Sqlite, let everyone to read and write to the db. so esent, once a process owns it, it can trust data in it is written by it but the Sqlite, it doesn't know which process wrote it.

now, we make sure one process own the db and no other can access it through a file lock (but still other program such as sqlite explorer can open the db while VS accessing it since sqlite itself let anyone to open it)

CyrusNajmabadi commented 6 years ago

current Sqlite, let everyone to read and write to the db. so esent, once a process owns it, it can trust data in it is written by it but the Sqlite, it doesn't know which process wrote it.

There is no need to know or trust which process wrote it. Data written is idempotent and is stored with teh checksum of the data it is stored against. So if you read data that was written by another process with the same checksum, then you can use it. If it has a different checksum, you can't use it and should recompute. However, the vast majority of the time, if you have multiple VSs open against the same DB, they'll have the same solutions and will then be able to share data.

CyrusNajmabadi commented 6 years ago

now, we make sure one process own the db and no other can access it through a file lock

Can you show me where this is done?

heejaechang commented 6 years ago

first 2 VS with same solution doesn't mean it has same code in it. so one can overwrite each other making one to always recalcuate data rather than reuse cache.

also, persistent storage is never only for cache, diagnostic, todo, designer attribute also save data to it so that it doesn't need to hold onto data in memory.

it basically breaks both cases.

basically, sqlite supported only 1 case over what previous persistent service provided with any process can overwrite its data and losing data integrity.

not sharing db bring that integrity back. we could make VS and OOP to share but that won't be straight-forward how to do that correctly. and locking whole db is very slow, since no read can even happen.

CyrusNajmabadi commented 6 years ago

we no longer share the same db between 2 processes. due to it was so slow (whole db got locked up)

I'm really not understanding the cases where this would happen. I even did work to test the DB on spinning rust systems. With the writes coming in only twice a second, even a slow system had no trouble writing out the data in these transactions. What testing was done to see what was going on here?

and got worse since more than one VS can share same db messing up the data inside

I don't see how any messing up of data is possible. The implementation abides exactly by the API definition of the IPersistentStorage interface. You can write streams of data in keyed by a combination of name/document/project/solution. And you can read out streams of data keyed by the exact same thing. This contract was always abided. All work was also done in transactions, so it should have been impossible to mess up any of the data inside.

and locking up whole dbs.

I'm trying to understand what this means. What sort of locking were you seeing that was problematic? I tested these systems out using several processes to check perf, and having them hammer away with both reads/writes. My numbers came out to be something like easily handling 50k writes at a time. Of course, during that time, it could support effectively an unlimited amount of reading since this was all simple indexed-row lookups.

heejaechang commented 6 years ago

http://source.roslyn.io/#Microsoft.CodeAnalysis.Workspaces.Desktop/Workspace/SQLite/SQLitePersistentStorageService.cs,136

CyrusNajmabadi commented 6 years ago

first 2 VS with same solution doesn't mean it has same info code it in.

It almost always does, for nearly the entire solution.

so one can overwrite each other making one to always recalcuate data rather than reuse cache.

Yes. For the files that are different between the solutions, the data will be recomputed. However, as the data for our indices is stored on a perf file basis, this will only be limited to the different files. THe vast majority of the rest of the solutoin can be reused.

Note that if you do not reuse the DB, then you'll have to recompute the indices for the entire solution.

heejaechang commented 6 years ago

well. I think I explained enough. I need to other stuff. so I am going to move to other stuff.

heejaechang commented 6 years ago

again, todo, diagnostic, designer scan save data to persistent storage and not hold the data in memory. if other process with same solution with different code in it save over write data, all data gets messed up.

this issue discovered because people complaining error list is broken. happened because other process wrote errors current VS doesnt have and it showed up in error list since user happen to scroll error list that read that data from db.

CyrusNajmabadi commented 6 years ago

also, persistent storage is never only for cache, diagnostic, todo, designer attribute also save data to it so that it doesn't need to hold onto data in memory.

Why would this be an issue for any of those systems?

basically, sqlite supported only 1 case over what previous persistent service provided with any process can overwrite its data and losing data integrity.

It is cross platform, supported by literally every environment under the sun, performance tuned over decades, doesn't lock you to a single process, very configurable, etc. etc.

not sharing db bring that integrity back. we could make VS and OOP to share but that won't be straight-forward how to do that correctly.

That is not true. It is very straightforward how to share a sqlite DB. It's a core facility that was designed for and has been supported for ages (possibly since day one).

and locking whole db is very slow, since no read can even happen.

I have no idea what this means. There should be no locking occurring. We use WAL mode, whihc means readers do not block writers and writers do not block readers. Writers will block other writers. However, that's not an issue within a single process since we don't do concurrent writes. Instead, we just have a single batch of writes issued at a time. With multiple processes it is true that one process writing would hold up the other process writing. But that should be very inconsequential given that it just needs to write the data into the WAL then release. We've designed our write batching to be super simple as well. it is just: open transaction, perform all writes sequentially, close transaction. So this is just the cost to dump that information to the write log.

--

Note: the WAL does have the 'checkpointing' phase, where the log is merged back into the DB. It's possible we might see a pause there. However, this is something we could address with a targeted tweak if the data showed this was affecting us.

CyrusNajmabadi commented 6 years ago

well. I think I explained enough. I need to other stuff. so I am going to move to other stuff.

I feel like there is a tremendous amount of misinformation and lack of understanding of how sqlite actually works and how it should be properly used.

I'm also bummed because i no longer have my perf scripts i used to validate and stress the sqlite code. It would be very useful to go back and reinvest there because i'm seriously worried now about things happening in this space that could dramatically affect perf (or stability, or data safety), and we will not catch it unless someone (like VS4Mac) notices an issue and brings attention to it.

heejaechang commented 6 years ago

we don't need to argue, just write simple app that multiple process overwrite data and try to maintain state in db.

write 1 to the db in process A, in the process B, write 2 to it. and read data from process A which expects 1. but it now gets 2.

CyrusNajmabadi commented 6 years ago

we don't need to argue, just write simple app that multiple process overwrite data and try to maintain state in db.

write 1 to the db in process A, in the process B, write 2 to it. and read data from process A which expects 1. but it now gets 2.

There is no contract in roslyn that says that when you read from IPersistentStorage you will get back the data you just wrote into it. Indeed, you don't need separate processes to show that. Just do the following:

write 1 to the DB in thread A, in thread B, write '2' to it. Read data back from thread A which expects 1, but not it gets 2

The API for IPersistentStorage does not say that within a process (or even a thread) a read of the data will give you back the last thing you wrote to it. Indeed, that's why all the layers that read/write from it that i've audited use checksums to ensure they can actually use the data.

heejaechang commented 6 years ago

all existing code relied on that behavior. and you argued to me many many times before that existing behavior gets precedent.

for example, when I changed behavior on "}" in formatting, you were furious saying existing behavior can't be changed, not just main one which I showed you it has same behavior. you tried to find all corner cases and saying I must prove all those behavior must be kept same. there was no contracts. and further more there is no broken behavior due to it. still doesn't matter.

now, I showed you existing code that rely on the behavior and told you it actually cause functional break. and you are saying it doesn't matter since it wasn't written in there?

I am pretty sure you will have some reason this is again different. so, I am not sure why I am writing this.

if I was the one, i would just said, oh, feature relied on that? and do some work to support it.

heejaechang commented 6 years ago

existing code before we moved to sqlite didn't use checksum in the peristent storage since it guaranteed data in is what data out. it was never I save something but no idea what data I will get back. that's new with sqlite.

CyrusNajmabadi commented 6 years ago

now, I showed you existing code that rely on the behavior

As i just pointed out, that behavior wasn't specified, tested for, guaranteed, or possible to provide even with the existing system. You're complaining about a new system having the same potential problem for these subsystems that the existing system had. That's the point I'm making.

and you are saying it doesn't matter since it wasn't written in there?

No. It definitely matters. But the right hting to do then is to make the requirements and functionality actually specified. If we actually depend on something like IPersistentStorage providing certain guarantees then we have to document that. Otherwise, no matter what, things will break.

for example, when I changed behavior on "}" in formatting, you were furious saying existing behavior can't be changed

You are comparing two entirely different issues. One issue relates to how our product works for customers. i.e. how a feature works that they are directly interacting with. We should not break that behavior without ensuring the new behavior is what we want for customers. The other is about how our internal systems work that are not publicly exposed at all. For internal systems we can make conscientious decisions about the design of things and how that may change over time, as long as that change stays hidden under the surface.

These are not comparable situations as i'm not talking about changing the way an actual user-facing feature works.

CyrusNajmabadi commented 6 years ago

since it guaranteed data in is what data out.

Please show me that guarantee. I think you are using 'guarantee' to mean: it behaved htis way in an undocumented fashion.

Note: changing such behavior is not good. Even if undocumented, we still don't want to break our existing components. I should have checked more thoroughly to realize that this need was baked in at the higher layers.

heejaechang commented 6 years ago

As i just pointed out, that behavior wasn't specified, tested for, guaranteed, or possible to provide even with the existing system. You're complaining about a new system having the same potential problem for these subsystems that the existing system had. That's the point I'm making.

we had tests that tests persistent storage. that write a data and read the data and see its same. have multiple thread read and write and see data same.

I will really now stop responding.

CyrusNajmabadi commented 6 years ago

have multiple thread read and write and see data same.

I legitimately don't know how you could have supported that. I've also looked at your tests. They do not do what you are saying. Those tests simply have a bunch of writers writing, and then they do a read, and it asserts that value read is one of the potential values written.

You also have tests with multiple reads. However, In the case of simultaneous reads, all you do is write data once, then show that all reads see that data.

Note: i'm looking at the tests in AbstractPersistentStorageTests. If you have actual tests that show that this situation does not happen, please point to it.

None of these verify this scenario works:

write 1 to the DB in thread A, in thread B, write '2' to it. Read data back from thread A which expects 1, but not it gets 2

The reason for that is that scenario did not work. If you had two threads writing the data, then thread A might see '1' or it might see '2'. Indeed, that's exactly how your test is written. That's why the assertions are:

                Assert.True(value >= 0);
                Assert.True(value < NumThreads);

The only 'guarantee' the older system had was (maybe) that if you had successfully awaited IPersistentStorage.WriteAsync, that if you called IPersistentStorage.ReadAsync that you might get the same data you wrote as long as nothing else wrote to that data in between those calls.

The previous system could not guarantee you would get back the same value you wrote. It would only guarantee you'd get back a value that was written prior to your read.

heejaechang commented 6 years ago

write 1 to the db in process A, in the process B, write 2 to it.

I wrote process not thread.

before data process wrote is what data the process get back.

CyrusNajmabadi commented 6 years ago

Yes. And the point i made explicitly was that your 'guarantee' was not a guarantee. Even in the same process you weren't guaranteed to get the same data back. That's why i said:

There is no contract in roslyn that says that when you read from IPersistentStorage you will get back the data you just wrote into it. Indeed, you don't need separate processes to show that. Just do the following:

write 1 to the DB in thread A, in thread B, write '2' to it. Read data back from thread A which expects 1, but not it gets 2

Your own tests even show that this is the case.

You then wrote:

we had tests that tests persistent storage ... have multiple thread read and write and see data same.

But that's not what the tests show. The tests show that if you have multiple threads reading and writing, you can see different data. That's why the test can't give a crisp assert and has to give a range of values. Because any given read cannot be certain that it will get back any specific write. The only way to guarantee that is to serialize out your reads and writes. And nothing about the IPersistentStorage API (or any of the impls we have) does that.

heejaechang commented 6 years ago

we change the argument to Thread.

if thread B wrote 2, then it will expect 2 rather than 1. since the process knows that it wrote 2. when it is written by other process. it has no idea that has happened.

no idea what you are arguing for and what you are trying to accomplish.

I told you what behavior existing code depends on. and without it, feature gets broken. not sure what more I need to tell you.

CyrusNajmabadi commented 6 years ago

no idea what you are arguing for

I was pointing out that your claim that this was a 'guarantee' was incorrect. There was no guarantee of this ever, which is why there is nothing that can be pointed to specifying that guarantee.

I was also pointing out that hte idea that there was some sort of guarantee 'within a process' was still bogus as you could trivially break that expectation just by using threads. Which is what your test even shows.

and what you are trying to accomplish.

I've simply been responding to your posts, correcting misconceptions I've seen in them as well as trying to figure out waht state we're in now. Clearly the sqlite work has changed since i've been gone, and it's hard for me to make any clear statements about it because so many core pieces of it have been affected.

From over here it looks like many decisions were made due to not having a full enough understanding of sqlite, and it also looks like those decisions will have a ongoing implications for designs for the workspace and OOP. I want to understand the decisions as well as possible because it affects my ability to make decisions and suggestions for these areas.

For example, designs around indices were made with the expectation that those indices could be shared. Now that i've learned that they are not shared, it affects decision making in other areas (i.e. the decisions about OOP and options and whatnot).

--

To me, the most important thing here is for all parties who work in this area to have as complete enough understanding of all these systems, how they work, and how they interact with each other. I also want to keep misinformation from spreading around to avoid anyone making decisions off of it that aren't the best given the substantive facts of the situation.

heejaechang commented 6 years ago

well. not sure why this is not guaranteed

http://source.roslyn.io/#Microsoft.VisualStudio.LanguageServices.CSharp.UnitTests/PersistentStorage/AbstractPersistentStorageTests.cs,119

it writes a data and it read the data and see it is the same.

it doesn't do write the data and read it and say I don't care other might have changed the data.

CyrusNajmabadi commented 6 years ago

well. not sure why this is not guaranteed

You have tests that show it is not guaranteed. Like PersistentService_Solution_SimultaneousWrites

http://source.roslyn.io/#Microsoft.VisualStudio.LanguageServices.CSharp.UnitTests/PersistentStorage/AbstractPersistentStorageTests.cs,119 it writes a data and it read the data and see it is the same.

That works because that test has serialized out access to that data by writing it sequentially on the same thread, and it does not run any other threads int the meantime.

It would be like me saying that there is no multi-processor issue because i wrote a test that only used a single process, wrote sequentially, and didn't do anything between the write and the read.

heejaechang commented 6 years ago

since multiple processes writing data is new behavior. according your argument before (which you had on multiple PRs), it should keep the behavior as before with one that doesn't let multiple processes to update same db.

also, multiple thread writing is not same as multiple processes writing it. and the test is multiple thread writing it not multiple process writing it. keep saying it is same doesnt make it same.

CyrusNajmabadi commented 6 years ago

it writes a data and it read the data and see it is the same.

This is like me saying that C# can't suffer from threading data races because i wrote a test that writes to a field, then reads back the field and sees the data is there.

That test may pass. But that's not giving you a 'guarantee'. All it is doing is demonstrating that behavior absent other factors (i.e. another thread writing to the field).

The same is true with IPersistentStorage. You even have tests for this with PersistentService_Solution_SimultaneousWrites. It shows that even if you have one part of your system writing, and you try to read, you may get some other result back entirely.

--

Note: such issues can be addressed by producing APIs that are safe against this sort of thing. For example, exposing a way to get an IPersistentStorage service instance that will give you guarantees if you're the only one with access to that instance. Right now (with sqlite) and previously (with esent) no such guarantees were made. All IPersistentStorage instances pointed at the same backing store (ignoring the no-op one that occasionally cropped up). So you could not be certain that your reads would necessarily see your writes as this was all effectively shared membory.

dotnet / roslyn

Possible multi-threading issue in Roslyn's sqlite usage #29599