Allow Parsing of RPM Metadata without loading cpio archive into memory

Richterrettich commented 3 years ago

Currently, you have to read the complete RPM into memory in order to access its metadata. I've encountered some workflows where the package content is completely irrelevant for further processing. Something like this:

let pkg = RPMPackage::parse(reader)?;
save_in_database(&pkg.metadata)?;
pkg.write(out_file)?;

In this case, reading the complete body in a byte buffer is a waste of memory. Not only that, since RPM's can potentially be multiple gigabyte in size, this also limits the usage of this library in memory sensitive deployments.

I currently see two solutions for this problem:

Make RPMPackageMetadata::parse and RPMPackageMetadata::write public methods and be done with it. This leaves the opportunity to read the Reader until the metadata object is created and decide for your self what you want to do with the actual CPIO archive. The downside to this is that we are widening the API surface which might or might not bite us later. Coming back to the original example, the code would look like this:
```
let metadata = RPMPackageMetadata::parse(reader)?;
save_in_database(&metadata)?;
metadata.write(out_file)?;
io::copy(reader,out_file)?;
```
Change the data type of RPMPackage::content from Vec<u8> to a generic reader. Something like this:
```
RPMPackage <T: io::BufRead> {
//...
content: T
}
```
This could be a simple cursor when creating the RPM. The issue I see with this is the confusing semantics in combination with RPMPackage::sign and RPMPackage::verify since both of them would need to consume the reader without any obvious indication. Since most network requests are not Seekable, it would not be good to narrow T to Read + Seek in general. One way around this could be to create a special impl for Read + Seek that features both sign and verify methods and leave them out otherwise.

@drahnr , do you have any thoughts on this?

drahnr commented 3 years ago

1. Make `RPMPackageMetadata::parse` and `RPMPackageMetadata::write`  public methods and be done with it.
   This leaves the opportunity to read the `Reader` until the metadata object is created and decide for your self what you want to do with the actual CPIO archive. The downside to this is that we are widening the API surface which might or might not bite us later.
   Coming back to the original example, the code would look like this:

let metadata = RPMPackageMetadata::parse(reader)?;
save_in_database(&metadata)?;
metadata.write(out_file)?;
io::copy(reader,out_file)?;

hugely in favour of this. The only extension I would add, is introducing an RPMPackageArchive which can parse the remaining data and verify (not uncompress) that it's actually an archive (not sure if that makes sense though).

Those two could then be combined (if desired) into RPMPackage::parse() { .. } which internally is nothing more than the sequence of RPMPackageHeader + RPMArchive.

Not sure how signing fits into this though, at what point is the signature checked? And if the two signatures are going to be checked separately?

A great advantage would be that this approach could easily be modeled as Future + Stream impls where the header parsing future impls Future<Item=(RPMPackageHeader, ArchiveStream)>, where the archive stream impls Stream<Item=FileEntry>.

But this also leaves the question when to check the signature.

Not sure if we need random access to individual files, or how doable that would be.

1. Change the data type of `RPMPackage::content` from `Vec<u8>` to a generic reader. Something like this:
RPMPackage <T: io::BufRead> {
  //...
  content: T
}
This could be a simple cursor when creating the RPM. The issue I see with this is the confusing semantics in combination with RPMPackage::sign and RPMPackage::verify since both of them would need to consume the reader without any obvious indication. Since most network requests are not Seekable, it would not be good to narrow T to Read + Seek in general. One way around this could be to create a special impl for Read + Seek that features both sign and verify methods and leave them out otherwise.

I am not so favourable with this, it hides quite a bit and it becomes non obvious when signatures are verified, which is a key in trust and processing chain and should be very explicit. To make this workable, one would also require a Clonebound on T to be able to handle signatures.

drahnr commented 3 years ago

To make sure we have the same discussion basis (and also for general documentation purposes of RPM), it might make sense to have a common a (tikz?) graphic to include in the documentation? What do you think to see the dependencies of which data depends on what and the implicit order constraints?

Richterrettich commented 3 years ago

I am also in favour of the first solution since it is the simplest. Creating an extra type for the archive itself could be beneficial.

The signature can only be checked if all bytes have been processed since the signature always spans both over metadata and archive. To make this process efficient we would need to ensure that the reader is only processed once and buffers are kept only for a processing window. Something like this (just a sketch)

let metadata = RPMPackageMetadata::parse(input)?;
//input has now advanced to the archive
metadata.write(out)?;
// the tee ensures that processed bytes are written immediately
let tee = tee(out,input);
let verifier = Verifier::new(key)?;
if let Err(e) = verifier.verify_reader(&metadata,tee) {
  // undo out here
}

This way, all bytes processed during the verification process are directly written to the final destination. If the verification fails, this has to be reverted, but this is a classic case of

it is easier to ask forgiveness than it is to get permission

Maybe we could provide a function to remove some of the boilerplate involved:

let metadata = rpm::RPMPackageMetadata::parse(input)?;
let out = create_out_based_on_metadata(&metadata)?; // not part of rpm-rs
let verifier = rpm::signature::PGPVerifier::new(key)?;
if let Err(e) = rpm::process_and_verify(metadata,input,out,verifier)?; {
  // undo out here
}

This leaves only the problem that the metadata itself might be large enough to cause OOM problems. But IMO this is more a theoretical problem. Also it is good practice to to check/limit the size of untrusted or unknown streams.

drahnr commented 3 years ago

Not sure there is really a point in resetting the out part, if it's not rpm, we would dump it anyways?

let (metadata, input) = rpm::RPMPackageMetadata::parse(input)?;

let verifier = rpm::signature::PGPVerifier::new(key)?;

let archive = rpm::RPMArchive::parse_and_verify(&metadata, input, verifier)?;

// or
let archive = rpm::RPMArchive::from_metadata(&metadata, input)?;
archive.verify(&metadata,verifier)?;

// or
let package = rpm::RPMPackage::from_metadata(metadata, input)?;
package.verify(verifier)?;

// or all in one
let pkg = rpm::RPMPackage::parse(input)?;
pkg.verify(verifier)?;

// combine parsed
let pkg = rpm::RPMPackage::combine(metadata, archive);
pkg.verify(verifier)?;

I would add a limiter configuration struct, to prevent oversized rpms.

Richterrettich commented 3 years ago

Not sure there is really a point in resetting the out part, if it's not rpm, we would dump it anyways?

Creating and cleaning out is the responsibility of library users. In case a tee reader gets used, out is completely processed by the time we are finished with the verification. For example, if out is a file, it will be persisted regardless of the outcome of verify. Once again, not our problem directly since it is the duty of library users to manage out. But the API has to be clear that this is happening and that it is indeed the users responsibility.

I am sure that library users will often need some information found in the metadata object to create out in the first place. For example:

let header = metadata.header;
let rpm_path = format!("{}-{}-{}.rpm",header.get_name()?,header.get_version()?,header.get_revision()?);
let out = fs::create(rpm_path)?

Sadly, I do not see another solution than using a tee reader since we need to go through the input multiple times without any option to reset it. The only option is to do all operations (verifying and writing) during the first process of input. This is in essence what makes the API design a little awkward since RPM is awkward in this regard. We need to process input to extract the header in order to get the necessary information about signatures. But to verify signatures efficiently we need an unprocessed input. This was clearly designed with files in mind where it is easy to reset input without any problems. This is not possible with network requests though. So a function like this:

rpm::process_and_verify(metadata,input,out,verifier)?;
// or 
rpm::write_and_verify(metadata,input,out,verifier)?;
// or whatever name may fit better for this.

is our best bet to make this happen without unnecessary overhead.

I have some problems with the proposed API:

let archive = rpm::RPMArchive::parse_and_verify(&metadata, input, verifier)?;

and

archive.verify(&metadata,verifier)?;

give the impression that the archive gets verified individually and to do so, it needs a metadata object and a verifier. But this is not how the RPM signing and verification process works. An RPM is always verified as a whole, meaning metadata and CPIO archive together. So it would make more sense to go the indirection of RPMPackage for the verification API in case all content is read to memory:

let (metadata, input) = rpm::RPMPackageMetadata::parse(input)?;
let archive = rpm::RPMArchive::parse(metadata,input)?; // at this point, the content is read to memory anyway
let verifier = rpm::signature::PGPVerifier::new(key)?;
let pkg = rpm::RPMPackage::new(metadata,archive);
pkg.verify(verifier)?;

which is basically what we have at the moment but with more checks regarding the archive itself and more control over the individual steps of processing.

Richterrettich commented 3 years ago

Okay, I just thought about how to make this more barable. We could introduce a type called RPMProcessor that is responsible for processing the RPM. This type can have multiple destinations and verifiers to make it completely configurable what happens during processing of input.

pub trait ProcessVerifier: io::Writer {
  fn verify(metadata: &RPMPacakgeMetadata) -> Result<(),RPMError> // gets called when input is completely consumed
}

impl ProcessVerifier for PGPVerifier {
 //...
}
let pgp_verifier = rpm::signature::PGPVerifier(key);
let custom_verifier = my_custom_verifier();
let processor = RPMProcessor::new(metadata,input)
                             .add_verifier(pgp_verifier)
                             .add_verifier(custom_verifier)
                             .add_destination(out)
                             .add_destination(another_location);

if let Err(e) = processor.process() {
   // handle error accordingly
}

creating a custom ProcessVerifier would be easy since you need only wrap an io::Writer.

drahnr commented 3 years ago

I think that's the most ergonomic API so far, lets go for it! The only thing I do not quite like is that a lot of details are hidden and now we require another trait for Signer implementations. Also note that iirc the pgp could not handle this, since it does not have a stream processing API and would make the whole effort moot, so it might make sense to file a PR to rpgp - which should be not too much work.

Richterrettich commented 3 years ago

Okay, I'll implement this.

Also note that iirc the pgp could not handle this, since it does not have a stream processing API and would make the whole effort moot, so it might make sense to file a PR to rpgp - which should be not too much work.

Yeah, this stinks but you are right. Do you have time to create a PR for rgpg? I am not quite sure why this is not higher on their priority list since this is IMHO a core feature to make rgpg viable for server deployments.

drahnr commented 3 years ago

Okay, I'll implement this.

Also note that iirc the pgp could not handle this, since it does not have a stream processing API and would make the whole effort moot, so it might make sense to file a PR to rpgp - which should be not too much work.

Yeah, this stinks but you are right. Do you have time to create a PR for rgpg? I am not quite sure why this is not higher on their priority list since this is IMHO a core feature to make rgpg viable for server deployments.

I will, but I it'll be probably a week until I get around to it

drahnr commented 3 years ago

https://github.com/rpgp/rpgp/pull/106 first attempt to introduce the trait io::Read API

Richterrettich / rpm-rs

Allow Parsing of RPM Metadata without loading cpio archive into memory #23