Open keltia opened 12 years ago
I've got similar functionality implemented as a patch to gmvault, so I don't have to duplicate all the gmvault repository functionality: https://github.com/gaubert/gmvault/pull/80 . Your work has some integrated tests though, good job :)
Yeah, I don't grok Python enough to integrate it as you did, I'm a Ruby guy :) I do try to use the "test first then code" concept though.
I'm a ruby guy too, but I felt python was the best tool for the job given all the work gaubert's already done :)
In your 'test' directory I see a little maildir 'Perso-Foo', but I don't see it being used in the tests at all, am I missing something? I'm interested because I'm not sure which is the best way to test gmvault -> maildir conversion. One could test 'does this gmvault repo yield exactly this maildir?' or 'does a round-trip conversion yield the original data?'.
I haven't implemented that test yet. The "key" used by Maildir is the partial path to the message, formed by concatenating several parts so implementing the test is a bit more complicated.
Yeah, it's even worse with mbox since it does some potentially non-reversible transformations :(
@keltia @vasi .Good job to both of you. I intend to integrate Vasi's work in the Gmvault distribution for the next version as it is in Python. I think that the web site also need a contribution area where I could point to external tools using Gmvault. Now @keltia What do you want exactly ? You are going through all the .meta files so you can collect all the available tags. Would like to have a kind of index with all the gmail ids for each tags ?
(Sorry for mixing up threads a bit, I'm moving this discussion back to the applicable issue.)
@keltia _What I want to store for each tag is the list of gmailid that have the tag, that way you can check whether a mail has a multiple tags or not and hard link them if you want.
For the purpose of hard linking, it should be ok without a cache/index, as long as you process message-by-message instead of label-by-label. For each message, create a new maildir message for the first label of the message, and then make hard links for the other applicable labels.
If for some reason you need to go label-by-label, some things do become very inefficient. I think it's reasonable for gmvault to at least keep a short list of all the labels present in the gmvault db, so you can do things like prompt the user to select a label for export. I'll look into implementing that, perhaps in the next couple of weeks. Ideally the .meta files would be eventually be replaced by a real database of some sort, but that would be bit more work :(
Ideally, getting the list of tags is gmvault's job at it will get all the mails anyway. It could be just a list to present to the user when one want to do the export or a more sophisticated index, pointing to gmail_ids, that could speedup the export. The main thing is, I'm trying to find a use for all that stored metadata :) My own program use a TokyoCabinet hash to enable incremental export. As a side note, it is also fun for me to play with it.
@keltia, Gmvault should continue to work on all platforms so TokyoCabinet is no go as it needs to be adapted to work on Windows. Now If I wanted to create an index, I would need use a SQL or NOSQL database but I am not settled on a definite choice right now. I would like a free schema db because SQL is not good at schema changes in the DB even if SQLite is the easiest choice (embedded by default in python dist). I would like an embeddable NOSQL option as I do not want to run a server but so far, I could not find any relevant option. I need to investigate and I will see. In the mean time I could create a simple json based index. Let's see
I threw together a branch that uses sqlite for metadata storage: https://github.com/vasi/gmvault/tree/storer_db . Syncing my ~5000 message test account works fine, with speed approximately the same as master. After syncing, it's easy for third-party software to find the list of all messages with a given label, or anything else interesting really. Some caveats:
@vasi great but implementing a GmvaultStorer on a SQL DB is not really my issue here. The data and metadata storage is encapsulated in GmvaultStorer so it is quite easy to change the persistence. It is more philosophical because using a database engine has some issues (especially a schema based one): evolution is limited or needs tooling to migrate schemas, database files get corrupted, lambda users cannot send you their faulty meta content because they do not know how to make a sql request .... It also has some advantages but mainly for geeks like you that want to take advantage of the data. This needs to be thought out. Maybe an option in the conf file to activate the DB storage would be the option. Let's think about it
Yes, I agree that you shouldn't just pull this! It's a proof of concept to see whether it could be done easily without sacrificing performance. You might want to cherry-pick the first commit though, which factors out the actual storage from the app logic: https://github.com/vasi/gmvault/commit/511e0ed31d9157f71b235c79bdb0f6a0542a2a21
For a complete solution, I'm not so worried about corruption or faulty data, though we should perhaps discuss that further. The really serious issue is the schema and migrations, as you mentioned. Unfortunately there aren't many good schema-less options. Pure key-value stores like shelve or KyotoCabinet make it hard to maintain collections, like "all the messages in label Foo", without ugly, unsafe hacks. A document store which supports indexing would be ideal, but I am not aware of any one that is embeddable and supports Python.
An interesting idea is to treat SQLite as a NoSQL db, like this: https://github.com/stochastic-technologies/goatfish . I'm not sure it quite fits our situation, but we could do something similar, basically storing most of our metadata in a BLOB field.
@vasi yeah evolution is my main concern here. Because if you start to dip in your tow in the Schema-based SQL DB, you're on for that issue. A Key Value store seems to be too limiting for what we want. In fact I would love to have a MongoDB like embedded persistence engine (Schema free but supports index and queries). That would be fantastic.
Another issue is really how user can report problems. Once you store the data in a binary format you need some tooling to help to user extracting what you need to understand where is the bug. I have no answers for that for the moment
Another issue is really how user can report problems. Once you store the data in a binary format you need some tooling to help to user extracting what you need to understand where is the bug.
Is the use case like: "User X backs up her email, finds message Y doesn't have the correct subject in the backup"? We could add a pretty simple "dump" or "debug" command to output the metadata in this case.
@vasi yes this could be an option.
I tried prototyping the "NoSQL-ish SQLite" idea, it went pretty well! Code is in my storer_db branch.
I'm basically treating SQLite as a document store. The main table 'messages' just contains (gm_id, json) tuples, where the json is whatever gmvault wants it to be. The table 'fields' is a list of the fields on which we want fast lookup, and the contents of those fields are in the table 'indexed'.
At any point gmvault can change what it decides to put in the json blobs, no ALTER TABLE is necessary. We can also change which fields provide fast lookup, without CREATE INDEX. For example, keltia's program could just do "INSERT INTO fields VALUES ('labels', 1)". Then from the next time gmvault is run, it will make sure it's possible to do lookups based on labels.
The only difficulty in the implementation was with the whole local_dir concept. Right now gmvault splits emails into directories for three entirely separate purposes:
This overloading is a bit fragile, and makes it hard to change the Storer backend. I did my best, but we might want to think about that further.
@vasi ok thanks I need to look at it. As for me I am playing with threads and processes to see if it speeds up the restore. So far in threaded mode it doesn't as a multi thread restore with email append and labelling in two different threads if slower (11 min instead of 9 min for a batch of 527 emails). I think that it is because of the GIL and will try a multiprocess mode.
Is this using ImapClient with multiple connections? I still think pipelining with asynchronous commands would be ideal, but it doesn't look like ImapClient supports that :(
Yes it is multiple connections (2 actually) with a queue between them. One thread is doing all the append it can and the other is applying the labels for each received job. It doesn't seem to be faster. I think there is also something done on the Gmail side because even with the GIL we cannot have 3 min difference between these 2 implementations.
So there is some kind of pipelining but Gmvault is IO bound with blocking IMAPClients so unless we implement our own imap lib with select or libev I fear it will not be faster.
I am going to do a multi-process implementation to see if we get something out of it. If not it means that is CPU/traffic bounded on the GMAIL side and there is nothing we can do about that.
On Mon, Oct 29, 2012 at 11:09 AM, Dave Vasilevsky notifications@github.comwrote:
Is this using ImapClient with multiple connections? I still think pipelining with asynchronous commands would be ideal, but it doesn't look like ImapClient supports that :(
— Reply to this email directly or view it on GitHubhttps://github.com/gaubert/gmvault/issues/92#issuecomment-9861217.
It looks like it's possible to do asynchronous/non-blocking IMAP with a different library: http://janeelix.com/piers/python/imaplib.html . Unfortunately that would be quite a big revamp of gmvault, so let's ignore that for now.
It might be worth trying thread pools? Instead of having one thread per task (append/labels), just have N threads that pop tasks from a queue, and perform callbacks on completion. The callbacks may then push new tasks to the queue; so when an append task finishes we'd push a label task. I think python's multiprocess library has a built in thread pool implementation.
@vasi I know this lib but we have to change the complete code because I use IMAPClient which is based on imaplib (in the standard lib). For what in the end, to be seen according to my findings below.
Now the thread of pool is not going to improve the speed. What you describe is what I do and the 2 threads I have should work asynchronously. Every time I finish to append a batch of email content, I push a labelling job in the queue corresponding to the current batch. So the second thread which is waiting on the queue should take the job and apply the labels in parallel. It turns out that it is not the case. I see when the labels are applied by a second thread with a second connection that the first thread is almost blocked (The email contents pushes are almost slowed down to 0). This is not normal something is happening. Either it is the GIL and this is why I want to go multi process or this is GMAIL IMAP that does something.
Hmm, that's weird. OfflineIMAP uses threads, not processes, and while it doesn't get a huge speedup, there definitely is some parallelism, even with Gmail. Specifically, I've seen SEARCH and FETCH happen simultaneously. Maybe it's using a different IMAP library, I haven't really looked at the OfflineIMAP code.
I wrote a little test of using parallelism to speed up fetching of messages: https://gist.github.com/3978039
The test attempts both using threads and processes, and both seem to work! I'm getting speedups between 2x and 4x, about the same with either technique. I'm using a worker pool of size 5, but maybe other values are better, I haven't tried any tuning.
I am not talking about Fetching here but appending and storing labels on associated uid. When you do that in parallel either in processes or threads, as soon as you start the storing it almost blocks the appending. I do not know why but I don't think it comes from Gmvault side. It is on Gmail side (maybe it triggers the indexing process whatever but they are doing something).
On Tue, Oct 30, 2012 at 3:56 AM, Dave Vasilevsky notifications@github.comwrote:
I wrote a little test of using parallelism to speed up fetching of messages: https://gist.github.com/3978039
The test attempts both using threads and processes, and both seem to work! I'm getting speedups between 2x and 4x, about the same with either technique. I'm using a worker pool of size 5, but maybe other values are better, I haven't tried any tuning.
— Reply to this email directly or view it on GitHubhttps://github.com/gaubert/gmvault/issues/92#issuecomment-9893502.
@vasi Yep, there is definitly a lock somewhere on Gmail when being in ALL MAIL and storing information. What I will do is to write an independant test (use full to discuss with GMAIL guys maybe) and we will stop the perf improvement for the moment. There was a substantial gain for the non-threaded version anyway. Going multi-thread for the fetch might be a solution but later one.
Hmm, I tried doing parallel APPEND and STORE, and I get weird results. I do see a speedup, though smaller than with fetching. However it doesn't look like concurrent APPEND is safe, sometimes Gmail decides to give two APPENDs the same sequence number, and just throws out the second one. It only seems to happen when APPENDing two messages in the same thread at the same time, I think. Ugh!
Hmm, I don't quite see this locked-while-storing thing. I'm able to store labels, in multiple threads, at the same time as a different thread is doing appends. However the storing sometimes fails mysteriously, so I have to check that it actually succeeded and then retry otherwise, so the overall speed ends up about the same. Man, Gmail's IMAP implementation is weeeeird.
Yes, the connections are concurrent but I think that on the backend the other jobs are blocked when the store is happening in ALLMAIL. There is probably some indexing jobs running and blocking things. You can clearly see the as soon as the STORE command has been sent, the appends are almost blocked. So There is no benefits in having multi-threaded or multi-process in the restore mode at the moment so we are not going to bother with it as the concurrency management and handling is much more complexe.
I have a script that takes a full gmvault-db and creates a Maildir-based mbox for a given tag. Having GMVault pre-cache in its directory a list of possible tags would be nice (json for example). I could use it myself.
Anyone interested can look at http://bitbucket.org/keltia/gmail-utils