Open asfimport opened 8 years ago
Renaud Delbru (migrated from JIRA)
Discussion copied from the following dev thread
I would strongly recommend against "invent your own mode", and instead using standardized schemes/modes (e.g. XTS).
Separate from that, I don't understand the reasoning to do it at the codec level. seems quite a bit more messy and complicated than the alternatives, such as block device level (e.g. dm-crypt), or filesystem level (e.g. ext4 filesystem encryption), which have the advantage of the filesystem cache actually working.
@rmuir,
Yes, you are right. This approach is more complex than plain fs level encryption, but this enables more fine-grained control on what is encrypted. For example, it would not be possible to choose which field to encrypt or not. Also, with fs level encryption, all the data is encrypted regardless if it is sensitive or not. For example, in such a scenario, the full posting lists will be encrypted which is unnecessary, and you'll pay the cost of encrypting the posting lists. It is true that if the filesystem caches unencrypted pages, then with a warm cache you will likely get better performance. However, this also means that most of the index data will reside in memory in an unencrypted form. If the server is compromised, then this will make life easier for the attacker. You have also the (small) issue with the swap which can end up with a large portion of the index unencrypted. This can be solved by using an encrypted swap, but this means that the data is now encrypted using a unique key and not a per-user key. Also, this adds complexity in the management of the system. Highly sensitive installations can make the trade-off between performance and security. There are some applications for Solr that are not served by the other approaches.
This codec was developed in the context of a large multi-tenant architecture, where each user has its own index / collection. Each user has its own key, and can update his key at any time. While it seems it would be possible with ext4 to handle a per-user key (e.g., one key per directory), it makes the key and index management more complex (especially in SolrCloud). This is not adequate for some environments. Also, it does not allow the management of multiple key versions in one index. If the user changes his key, we have to re-encrypt the full directory which is not acceptable wrt performance for some environments.
The codec level encryption approach is more adequate for some environments than the fs level encryption approach. Also, it is to be noted that this codec does not affect the rest of Lucene/Solr. Users will be able to choose which approach is more adequate for their environment. This gives more options to Lucene/Solr users.
Robert Muir (@rmuir) (migrated from JIRA)
It is true that if the filesystem caches unencrypted pages, then with a warm cache you will likely get better performance. However, this also means that most of the index data will reside in memory in an unencrypted form. If the server is compromised, then this will make life easier for the attacker.
These are the correct tradeoffs to make though. It is fast and makes it work "at rest".
On the other hand, from your description: reusing IVs per segment and so on, that is no CBC mode, sorry its essentially ECB mode: this is just not secure.
If we are going to add encryption to lucene, it should actually be secure!
Renaud Delbru (migrated from JIRA)
I agree with you that if we add encryption to Lucene, it should always be secure. That's why I opened up the discussion with the commnunity in order to review and agree on which approach to adopt. With respect to IV reuse with CBC mode, a potential leak of information occurs when two messages share a common prefix, as it will reveal the presence and length of that prefix. Now if we look at each format separately and at what type of messages is encrypted in each one, we can assess the risk:
The risk of reusing IV seems to reside in Stored Fields / Terms Vector is not acceptable, one solution is to add a random generated header to each compressed doc chunk that will serve as a unique IV. What do you think ?
Robert Muir (@rmuir) (migrated from JIRA)
I am not sure where some of these ideas like "postings lists don't need to be encrypted" came from, but most of the design presented on this issue is completely insecure. Please, if you want to do this stuff in lucene, it needs to be a standardized scheme (like XTS or ESSIV) with all the known tradeoffs already computed. You can be 100% sure that if "crypto is invented here" that I'm gonna make comments on the issue, because it is the right thing to do.
The many justifications for doing it in a complicated way in the codec level seems to revolve around limitations in solrcloud, rather than from good design. Because you really can put different indexes in different directories and let the operating system do it for "multitenancy". Because Lucene has stuff like ParallelReader and different fields can be in different indexes if you really need that, etc, etc. Alternative everywhere which would allow you to still "let the OS do it", be secure, and have a working filesystem cache (be fast).
Gennadiy Geyfman (migrated from JIRA)
Thanks for your feedback, Robert.
Just to give you some background, this feature is being developed for Salesforce, where we have customers that not only require robust encryption, but that also have strong compliance requirements for key management. We are not currently users of Solr Cloud, which I mention if just to assure you that none of the differences between Solr vs. Solr Cloud have had any impact on this design.
As far as the solution itself is concerned, you are correct that ECB mode is not secure if blocks are not unique. However, in our use case we can ensure that all blocks are unique and our security team would argue that this make it equivalent to using CTR or CBC mode, making it secure. Nonetheless, our case would be stronger if the uniqueness property was actually guaranteed rather than suggested, so we will seek to refine the design in this way.
Also, with regard to your other concerns, we evaluated encryption at the filesystem level, but our conclusion was that this will bring even more complexity than index-level encryption, especially when one considers typical compliance requirements for key management. Encryption at the filesystem level would also require thoughtful planning for our backup / restore operations to ensure that backups are encrypted as well. More importantly, with file-level encryption, data would reside in an unencrypted form in memory which is not acceptable to our security team and, therefore, a non-starter for us.
Hopefully this gives you a better idea of the thinking that went into this proposed design. We agree that for other users of Solr, encryption by the OS could easily make more sense. Moreover, nothing in our proposal would stop anyone from pursuing that path. But for our use case and for others needing enterprise search functionality, this solution gives more granular control of the keys as well as the end-to-end encryption process.
If you have any additional questions or concerns, please let me know and I'll try to answer them as best I can. Thank you, again, for taking the time to work with us on this much-needed feature.
Thanks Salesforce Search Team Gennadiy Geyfman Sr. Director of Engineering
Robert Muir (@rmuir) (migrated from JIRA)
More importantly, with file-level encryption, data would reside in an unencrypted form in memory which is not acceptable to our security team and, therefore, a non-starter for us.
This speaks volumes. You should fire your security team! You are wasting your time worrying about this: if you are using lucene, your data will be in memory, in plaintext, in ways you cannot control, and there is nothing you can do about that!
Trying to guarantee anything better than "at rest" is serious business, sounds like your team is over their head.
Gennadiy Geyfman (migrated from JIRA)
To provide a little more background, there are several reasons why we thought a file system based encryption scheme would not work for our case. One of the core challenges we are facing is the management of a large number of user keys across our infrastructure. We have a complex security environment with key management procedures driven by customers with different security requirements. As a result, different indices have to be encrypted with different keys and with different crypto providers - all of which we need to manage.
Second, we also need to control the index life cycle which includes applying of a new encryption keys, transfer index over the network and manage index backups that are stored in the cloud. For this reason, it will be much less complicated to have control over the key providers and data encryption in one central point (Lucene).
Given these requirements, our evaluation showed that file-system encryption would lead us to larger effort and investment in order to keep secure control across the our infrastructure.
Adam Williams (migrated from JIRA)
Iron Mountain is also interested in this solution. We have been following this ticket and hoping that it will become a reality. We have over 90,000 customers with varying security rules ranging from banks to healthcare and government.
Ultimately, our responsibility is to ensure that we meet the needs of our customers. We are solr cloud based with 4.8 billion indexed documents on over 120 virtual machines and 26 clouds. Many of our large customers have encryption requirements, while some of our other customers have none. For us to do disk based encryption on shared storage is not ideal. The cost for TBs of data in teir-1 storage is high. Also, this approach allows us to set encryption for those who need it and turn it off for those who do not. By allowing flexibility in the key generation and crypto provider, we can provide a solution to meet the security needs of many customers.
Joel Bernstein (@joel-bernstein) (migrated from JIRA)
Alfresco is also interested in this ticket. I'd like to see if there is a way to reach consensus on an approach for moving this forward. This will likely mean making changes in the patch to address security concerns.
Thomas Mueller (migrated from JIRA)
Your proposal doesn't sound very secure. I would recommend to rename this feature as "scrambling", and not use the term "encryption".
> you are correct that ECB mode is not secure if blocks are not unique. However, in our use case we can ensure that all blocks are unique and our security team would argue that this make it equivalent to using CTR or CBC mode, making it secure.
Even thought I'm not an expert, I'm almost 100% sure that no, this is not secure.
> it would not be possible to choose which field to encrypt or not
How important is this? Why can't you use two indexes, one encrypted (properly, with XTS) and the other one is not?
As for XTS, it is fairly simple to implement. If you like, I can contribute the XTS code I have written for the H2 database.
Karl Sanders (migrated from JIRA)
I would like to let you know that I made the H2 and HSQLDB communities aware about this interest in providing a secure backend for Lucene. H2's author, Thomas Mueller, has already been so kind as to provide his feedback. Both these databases provide encryption at the file level. In both cases I mentioned the possibility to use the database as an encrypted storage for Lucene data. Or even to go as far to add (or improve, in the case of H2) Lucene integration.
Maybe some of the companies interested in adding this capability to Lucene might want to reach out to them too, showing that there's the concrete possibility for a partnership or to simply do some contract work.
I sincerely hope that Lucene file-level encryption can become a reality.
Thomas Mueller (migrated from JIRA)
> More importantly, with file-level encryption, data would reside in an unencrypted form in memory which is not acceptable to our security team
You could use homomorphic encryption (just joking). The best you can realistically do is overwrite the plain text encryption password in memory as soon as you have hashed it, but even that is a challenge because the JVM garbage collector could copy the password.
I agree and understand it would be good to have encryption / decryption done in Lucene itself. With filesystem-level encryption you wouldn't need any changes in Lucene, but is a bit challenging for other reasons (backup and restore for example). There are still challenges with key management of course. One option is to read the encryption password from a config file at startup, and delete that file after use. To start the application, use a script that copies the config file (with the password) using user "x", but run the application as user "y" (with lower privileges). That way, the (plain text) password is not there during normal operation. Or ask the operator to type in the password at startup, or ask for the password on a web page... But it's more flexible than with filesystem-level encryption.
Thomas Mueller (migrated from JIRA)
The approach taken in #3304 sounds sensible to me: "AESDirectory extends FSDirectory". Even thought the patch would need to be improved: nowadays XTS should be used.
Karl Sanders (migrated from JIRA)
There's an apparently abandoned project that might be of interest: https://code.google.com/archive/p/lucenetransform/
It appears to be implementing compression and encryption for Lucene indexes. I also found a couple of related links.
Some considerations about how it's being used in another project: https://github.com/muzima/documentation/wiki/Security-regarding-stored-data-by-using-Lucene
A discussion about ensuring that indexes aren't tampered with: http://permalink.gmane.org/gmane.comp.jakarta.lucene.user/50495
Renaud Delbru (migrated from JIRA)
Thanks for all of the feedback. Based on everyone's comments, it seems like different encryption algorithms might be better depending on the situation. Rather than implement a one-size-fits-all solution then, perhaps it would be better not to enforce any one cipher and instead leave some flexibility for users to choose the cipher they find more appropriate.
If everyone is okay with this approach, I will update the code appropriately.
Karl Sanders (migrated from JIRA)
Rather than implement a one-size-fits-all solution then, perhaps it would be better not to enforce any one cipher and instead leave some flexibility for users to choose the cipher they find more appropriate.
I think this is extremely reasonable.
I would like to ask if this patch will also provide "FSDirectory-level encryption" like #3304.
Renaud Delbru (migrated from JIRA)
Karl, the patch will not include a ready to use FSDirectory implementation, but the doc value format is based on an encrypted index input and output implementation which can easily be reused in an implementation of FSDirectory.
Renaud Delbru (migrated from JIRA)
This patch contains the current state of the codec for index-level encryption. It is up to date with the latest version of the lucene-solr master branch. This patch does not include yet the ability for the users to choose which cipher to use. I'll submit a new patch that will tackle this issue in the next coming week. The full lucene test suite has been executed against this codec using the command:
ant -Dtests.codec=EncryptedLucene60 test
Only one test fails, TestSizeBoundedForceMerge#testByteSizeLimit, which is expected. This test is incompatible with the codec.
The doc values format (prototype based on an encrypted index output) is not included in this patch, and will be submitted as a separate patch in the next coming days.
Renaud Delbru (migrated from JIRA)
This patch includes changes so that every encrypted data block uses a new iv. The iv is encoded in the header of the data block. The CipherFactory has been extended so that people can decide on how to instantiate a cipher and how to generate new ivs.
The performance impact of storing and using a unique iv per block is minimal. The results of the benchmark below (performed on the full wikipedia dataset) show that there is no significant difference in qps:
TaskQPS 6966-before StdDevQPS 6966-after StdDev Pct diff
Respell 20.56 (11.2%) 19.18 (7.9%) -6.7% ( -23% - 13%)
Fuzzy2 33.98 (11.7%) 32.76 (11.0%) -3.6% ( -23% - 21%)
Fuzzy1 31.13 (11.2%) 30.05 (8.2%) -3.5% ( -20% - 17%)
PKLookup 125.62 (13.0%) 121.38 (8.8%) -3.4% ( -22% - 21%)
Wildcard 35.10 (11.7%) 34.36 (8.2%) -2.1% ( -19% - 20%)
OrNotHighMed 25.90 (11.4%) 25.86 (10.5%) -0.2% ( -19% - 24%)
OrNotHighHigh 15.26 (12.1%) 15.28 (10.8%) 0.2% ( -20% - 26%)
OrHighNotHigh 9.80 (12.4%) 9.82 (12.0%) 0.2% ( -21% - 28%)
OrHighNotMed 13.01 (13.4%) 13.06 (13.0%) 0.4% ( -22% - 30%)
LowTerm 252.64 (12.5%) 253.90 (8.7%) 0.5% ( -18% - 24%)
OrHighNotLow 35.63 (13.5%) 35.83 (13.4%) 0.6% ( -23% - 31%)
Prefix3 21.70 (13.3%) 21.86 (9.7%) 0.7% ( -19% - 27%)
MedTerm 83.04 (11.7%) 83.73 (8.0%) 0.8% ( -16% - 23%)
AndHighHigh 15.41 (10.6%) 15.61 (7.9%) 1.3% ( -15% - 22%)
LowSloppyPhrase 68.89 (12.5%) 69.90 (9.0%) 1.5% ( -17% - 26%)
AndHighLow 294.02 (11.6%) 299.04 (8.3%) 1.7% ( -16% - 24%)
OrHighMed 10.92 (14.4%) 11.13 (10.8%) 1.9% ( -20% - 31%)
OrHighHigh 9.45 (14.6%) 9.63 (10.9%) 1.9% ( -20% - 32%)
MedSpanNear 69.01 (11.9%) 70.39 (8.4%) 2.0% ( -16% - 25%)
AndHighMed 45.16 (12.4%) 46.17 (9.1%) 2.2% ( -17% - 27%)
HighTerm 16.61 (13.3%) 16.99 (9.5%) 2.3% ( -18% - 28%)
LowPhrase 3.03 (11.1%) 3.10 (9.2%) 2.3% ( -16% - 25%)
HighPhrase 11.82 (13.0%) 12.10 (9.6%) 2.4% ( -17% - 28%)
MedPhrase 7.49 (12.1%) 7.67 (9.1%) 2.4% ( -16% - 26%)
OrNotHighLow 424.80 (11.1%) 434.97 (8.2%) 2.4% ( -15% - 24%)
OrHighLow 25.08 (12.0%) 25.70 (11.7%) 2.5% ( -18% - 29%)
HighSloppyPhrase 4.01 (13.7%) 4.11 (9.7%) 2.5% ( -18% - 30%)
MedSloppyPhrase 6.61 (12.9%) 6.78 (9.2%) 2.5% ( -17% - 28%)
LowSpanNear 15.52 (11.8%) 15.91 (8.6%) 2.5% ( -16% - 26%)
IntNRQ 3.76 (16.4%) 3.86 (13.1%) 2.7% ( -23% - 38%)
HighSpanNear 4.40 (12.8%) 4.52 (9.1%) 2.8% ( -16% - 28%)
I have took the occasion to run another benchmark to compare this patch against lucene's master. We can see that queries on low frequency terms (probably because the dictionary lookup becomes more costly than reading of the posting list) and queries that needs to scan a large portion of the dictionary are the most impacted.
Task QPS master StdDev QPS 6966 StdDev Pct diff
Fuzzy1 55.08 (15.5%) 35.89 (8.2%) -34.8% ( -50% - -13%)
Respell 39.31 (16.9%) 28.47 (8.2%) -27.6% ( -45% - -3%)
Fuzzy2 35.33 (16.8%) 28.21 (8.8%) -20.1% ( -39% - 6%)
Wildcard 11.13 (18.9%) 9.95 (7.9%) -10.6% ( -31% - 19%)
AndHighLow 304.79 (17.7%) 277.30 (10.4%) -9.0% ( -31% - 23%)
OrNotHighLow 240.56 (16.8%) 226.64 (10.2%) -5.8% ( -28% - 25%)
PKLookup 129.54 (20.1%) 122.47 (8.3%) -5.5% ( -28% - 28%)
LowTerm 272.31 (19.0%) 269.78 (11.0%) -0.9% ( -25% - 35%)
MedPhrase 136.59 (19.3%) 137.73 (11.4%) 0.8% ( -25% - 39%)
AndHighMed 76.93 (19.3%) 79.00 (10.7%) 2.7% ( -22% - 40%)
MedSpanNear 61.75 (20.7%) 63.99 (13.4%) 3.6% ( -25% - 47%)
AndHighHigh 22.96 (15.8%) 23.84 (10.9%) 3.9% ( -19% - 36%)
Prefix3 41.31 (16.6%) 42.99 (12.5%) 4.1% ( -21% - 39%)
LowSpanNear 61.36 (19.6%) 64.05 (9.7%) 4.4% ( -20% - 41%)
OrNotHighMed 48.58 (17.4%) 51.01 (8.6%) 5.0% ( -17% - 37%)
MedSloppyPhrase 36.04 (19.4%) 38.03 (9.5%) 5.5% ( -19% - 42%)
LowPhrase 34.04 (19.6%) 35.95 (10.6%) 5.6% ( -20% - 44%)
MedTerm 47.60 (20.1%) 50.45 (12.9%) 6.0% ( -22% - 48%)
HighSpanNear 3.95 (21.7%) 4.21 (13.6%) 6.4% ( -23% - 53%)
HighPhrase 5.35 (18.5%) 5.71 (7.6%) 6.6% ( -16% - 39%)
LowSloppyPhrase 17.02 (20.1%) 18.21 (10.1%) 7.0% ( -19% - 46%)
OrHighNotHigh 14.65 (19.5%) 15.78 (9.6%) 7.7% ( -17% - 45%)
OrHighMed 17.78 (20.6%) 19.26 (9.3%) 8.3% ( -17% - 48%)
HighTerm 29.81 (20.7%) 32.32 (10.8%) 8.4% ( -19% - 50%)
OrHighHigh 7.17 (21.4%) 7.77 (11.3%) 8.5% ( -19% - 52%)
OrNotHighHigh 9.66 (19.7%) 10.51 (7.5%) 8.8% ( -15% - 44%)
HighSloppyPhrase 2.50 (20.6%) 2.72 (10.7%) 8.8% ( -18% - 50%)
OrHighNotMed 17.24 (22.6%) 18.94 (10.3%) 9.9% ( -18% - 55%)
IntNRQ 2.97 (20.8%) 3.28 (16.4%) 10.2% ( -22% - 59%)
OrHighNotLow 12.74 (22.9%) 14.13 (12.8%) 10.9% ( -20% - 60%)
OrHighLow 21.04 (21.9%) 23.41 (8.6%) 11.3% ( -15% - 53%)
Joel Bernstein (@joel-bernstein) (migrated from JIRA)
Hi Renaud Delbru,
Thanks for your work on this. I've read through the patch and it's quite a large piece of work. I mentioned earlier in the ticket that Alfresco is interested in this, so I wanted to ask some questions and see if I could understand it better.
1) With the latest patch do you feel the major concerns have been addressed. I'll copy a few of them below:
On the other hand, from your description: reusing IVs per segment and so on, that is no CBC mode, sorry its essentially ECB mode: this is just not secure.
I am not sure where some of these ideas like "postings lists don't need to be encrypted" came from, but most of the design presented on this issue is completely insecure. Please, if you want to do this stuff in lucene, it needs to be a standardized scheme (like XTS or ESSIV) with all the known tradeoffs already computed. You can be 100% sure that if "crypto is invented here" that I'm gonna make comments on the issue, because it is the right thing to do.
2) From my initial reading of the patch it seemed like everything in the patch was pluggable. Does this need to be committed to be usable? Or can it be hosted on another project?
3) Because it's such a large patch and codecs change over time, does it present a burden to maintain with the core Lucene project? Along these lines is it more appropriate from a maintenance standpoint to be maintained by people who are really motivated to have this feature. Alfresco engineers would likely participate in an outside project if one existed.
Renaud Delbru (migrated from JIRA)
Hi @joel-bernstein,
1) With the latest patch do you feel the major concerns have been addressed.
Yes, the latest patch does not reuse IVs anymore but instead use a different IV for each data block. It also introduces an API so that one can have control on how IVs are generated and how the cipher is instantiated.
2) From my initial reading of the patch it seemed like everything in the patch was pluggable. Does this need to be committed to be usable? Or can it be hosted on another project?
3) Because it's such a large patch and codecs change over time, does it present a burden to maintain with the core Lucene project? Along these lines is it more appropriate from a maintenance standpoint to be maintained by people who are really motivated to have this feature. Alfresco engineers would likely participate in an outside project if one existed.
The patch follows the standard rules of Lucene codecs, so yes, it is fully pluggable. Similar to other codecs, however, the burden to maintain it will be low. It is a set of Lucene's *Format classes that are loosely coupled with other part of the Lucene code. It will likely require maintenance only when the high-level Lucene's Codec and Format API changes.
The patch is large because we had to make a copy of some of the original lucene *Format classes, as those classes were final and not extensible. If one wants to update them with the latest improvements made in the original classes, this might require a bit more effort, but from my personal experience it was so far straightforward.
Renaud Delbru (migrated from JIRA)
Here is a separate patch (to apply on top of LUCENE-6966-2) for the doc values format. It is a prototype based on an encrypted index input/output. The encrypted index output writes encrypted data blocks of fixed size. Each data block has its own initialization vector.
Renaud Delbru (migrated from JIRA)
I think the latest patch is ready for commit, any objections ?
Joel Bernstein (@joel-bernstein) (migrated from JIRA)
I have a couple of issues with the design from a security standpoint:
1) The security tradeoffs of leaving the posting list in the clear is unknown.
2) Encryption at the codec level makes encryption part of the schema design. This leaves open opportunities for users to design insecure schemas that they believe are secure. For example data can leak from encrypted fields to un-encrypted fields as it's copied around to support sorting, faceting, suggestion, multi-language search etc...
Renaud Delbru (migrated from JIRA)
Is there still interest from the community in considering this patch as a contribution ? Even if there are limitations and therefore this will not cover all possible scenarios, we think this provides an initial set of core features and a good starting point for future work. We received multiples personal request for this patch which shows there is a certain interest for such a feature. I am attaching also an initial technical documentation that explains how to use the codec and clarifies its current known limitations.
Renaud Delbru (migrated from JIRA)
An initial technical documentation.
Otis Gospodnetic (@otisg) (migrated from JIRA)
Uh, silence. :( I have not looked into the implementation and have only skimmed comments here in the past. My general feeling though is that until/unless this gets committed most people won't bother looking (I think we saw similar behaviour with Solr CDCR which was WIP in JIRA for a while and was labeled as such for a long time.... but now that it's in I hear more and more people using it.... http://search-lucene.com/?q=cdcr ) and once it's in it may get worked on by more interested parties.
Jan Høydahl (@janhoy) (migrated from JIRA)
Renaud Delbru I believe the reason you are not seeing more traction on this is not because it is not quality work or useful, but rather that 1) It is only a tiny percentage of Lucene users who need this level of security 2) The patch is huge and complex, so most committers won't have bandwidth (or expertise) to QA it.
There is obviously also a concern about future maintenance load if this needs to be touched for each version, and for each new index feature, with the risk of introducing a bug that breaks security. I'm sure that if a couple of developers with in-depth knowledge of the feature and security expertise were willing to contribute long-term on this you would probably be nominated as committers and the feature would have a safer future.
Have you considered starting by maintaining the project on GitHub, and produce releases (and maven artifacts), along with Lucene and Solr usage instructions? This would bring more focus, attract PRs, and I would expect it to be a popular project very soon. Of course, if there are lucene-core changes that are needed for the plungins to work, those would need to be committed first.
Shane (migrated from JIRA)
This may be of a higher interest due to GDPR Regulations now?
Has anyone considered this ticket recently?
David Smiley (@dsmiley) (migrated from JIRA)
My peers and I at Salesforce have had this on our minds a bit, while we maintain this with an internal fork of Lucene/Solr. I believe someone at Microsoft said they have this requirement and implemented it with only the Lucene Directory abstraction. So yeah, some few big companies :) I think there would be traction here if a open-source contribution could be scoped to a Lucene Directory. It would be an encryptable Lucene Directory wrapper, likely using FilterDirectory. Such a contribution would not include modifications to Codec related APIs (no PostingsFormats etc.) and to no existing APIs. This is minimally sufficient and probably good enough. The test side might provide a ROT13 impl for testing but otherwise it'd be up to the user to plug something in. There would furthermore be like nothing else and thus nothing else to review or disagree about. This would massively reduce the scope of the contribution here and, speaking for myself, is a very viable contribution that I would be happy to review and likely commit. It's so scoped down from the original contribution here that another linked issue would be more appropriate than this one.
Shane (migrated from JIRA)
The Directory approach is very similar to what we have seen DataStax implement for their approach, so it sounds like a viable strategy - they are using an
EncryptedFSDirectoryFactory solution that works pretty well in our testing.
Jan Høydahl (@janhoy) (migrated from JIRA)
+1 for a simple Directory based approach. Anyone who can lobby for a contribution? I have clients asking for this as well.
Juraj Jurčo (migrated from JIRA)
+1 also hope this is still not dead.. We would appreciate it as well.
Bruno Roustant (@bruno-roustant) (migrated from JIRA)
+1
I'm going to work soon on this simple Directory based approach. I've created #10419 to follow that in a separate issue.
I'll try to inspire from the previous works (related links) and I'll share my plan first to start discussions ahead.
With the new requirement of PCI 4.0 that disk encryption cannot be the only protection for data at rest, this contribution becomes very crucial, is there any progress with this ? https://www.vikingcloud.com/blog/pci-dss-v4-are-you-using-disk-encryption
We would like to contribute a codec that enables the encryption of sensitive data in the index that has been developed as part of an engagement with a customer. We think that this could be of interest for the community.
Below is a description of the project.
Introduction
In comparison with approaches where all data is encrypted (e.g., file system encryption, index output / directory encryption), encryption at a codec level enables more fine-grained control on which block of data is encrypted. This is more efficient since less data has to be encrypted. This also gives more flexibility such as the ability to select which field to encrypt.
Some of the requirements for this project were:
What is supported ?
How it is implemented ?
Key Management
One index segment is encrypted with a single key version. An index can have multiple segments, each one encrypted using a different key version. The key version for a segment is stored in the segment info.
The provided codec is abstract, and a subclass is responsible in providing an implementation of the cipher factory. The cipher factory is responsible of the creation of a cipher instance based on a given key version.
Encryption Model
The encryption model is based on AES/CBC with padding. Initialisation vector (IV) is reused for performance reason, but only on a per format and per segment basis.
While IV reuse is usually considered a bad practice, the CBC mode is somehow resilient to IV reuse. The only "leak" of information that this could lead to is being able to know that two encrypted blocks of data starts with the same prefix. However, it is unlikely that two data blocks in an index segment will start with the same data:
Stored Fields Format: Each encrypted data block is a compressed block (\~4kb) of one or more documents. It is unlikely that two compressed blocks start with the same data prefix.
Term Vectors: Each encrypted data block is a compressed block (\~4kb) of terms and payloads from one or more documents. It is unlikely that two compressed blocks start with the same data prefix.
Term Dictionary Index: The term dictionary index is encoded and encrypted in one single data block.
Term Dictionary Data: Each data block of the term dictionary encodes a set of suffixes. It is unlikely to have two dictionary data blocks sharing the same prefix within the same segment.
DocValues: A DocValues file will be composed of multiple encrypted data blocks. It is unlikely to have two data blocks sharing the same prefix within the same segment (each one will encodes a list of values associated to a field).
To the best of our knowledge, this model should be safe. However, it would be good if someone with security expertise in the community could review and validate it.
Performance
We report here a performance benchmark we did on an early prototype based on Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all the fields (id, title, body, date) were encrypted. Only the block tree terms and compressed stored fields format were tested at that time.
Indexing
The indexing throughput slightly decreased and is roughly 15% less than with the base Lucene.
The merge time slightly increased by 35%.
There was no significant difference in term of index size.
Query Throughput
With respect to query throughput, we observed no significant impact on the following queries: Term query, boolean query, phrase query, numeric range query.
We observed the following performance impact for queries that needs to scan a larger portion of the term dictionary:
We can see that the decrease of performance is relative to the size of the dictionary scan.
Document Retrieval
We observed a decrease of performance that is relative to the size of the set of documents to be retrieved:
Known Limitations
compressed stored field do not keep order of fields since non-encrypted and encrypted fields are stored in separated blocks.
the current implementation of the cipher factory does not enforce the use of AES/CBC. We are planning to add this to the final version of the patch.
the current implementation does not change the IV per segment. We are planning to add this to the final version of the patch.
the current implementation of compressed stored fields decrypts a full compressed block even if a small portion is decompressed (high impact when storing very small documents). We are planning to add this optimisation to the final version of the patch. The overall document retrieval performance might increase with this optimisation.
The codec has been implemented as a contrib. Given that most of the classes were final, we had to copy most of the original code from the extended formats. At a later stage, we could think of opening some of these classes to extend them properly in order to reduce code duplication and simplify code maintenance.
Migrated from LUCENE-6966 by Renaud Delbru, 12 votes, updated May 22 2020 Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, LUCENE-6966-2.patch, LUCENE-6966-2-docvalues.patch Linked issues:
10419
3304