WeTransfer / format_parser

file metadata parsing, done cheap
https://rubygems.org/gems/format_parser
Other
62 stars 18 forks source link

Use FormatParser with ActiveStorage #158

Closed benoittgt closed 4 years ago

benoittgt commented 4 years ago

This is heavily inspired by the good job made on #126.

Few questions I have: 1) Do we want to wrap ActiveStorage errors? Like ActiveStorage::FileNotFoundError? 2) Did I properly handle BlobIO#size and @pos.bytesize? 3) Different behavior with a io.read(NEGATIVE_INT). Local storage returns "", S3 returns all the file. I am wondering in which case we could have a negative n_bytes to read? 4) Inspired by https://github.com/shrinerb/shrine/commit/8d92a9fc2ab6d742f1a022af2dc65fb11dbb0408, I choose Minio for S3 interaction. This add a lot of code but also simplify local development. https://shrinerb.com/docs/testing#minio

As mentioned in a previous comment by @julik

There are also cases with "reading past the end" of the remote resource that need to be covered etc.

Is this you are speaking about ?

2.7.1 :001 > io = BlobIO.new(@blob)
2.7.1 :002 > io.size
 => 46130
2.7.1 :003 > io.seek(100_000)
 => 0
2.7.1 :004 > io.read(100)
Traceback (most recent call last):
       16: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/activestorage-6.0.3/lib/active_storage/service.rb:126:in `instrument'
       15: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/activesupport-6.0.3/lib/active_support/notifications.rb:182:in `instrument'
       14: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/activestorage-6.0.3/lib/active_storage/service/s3_service.rb:43:in `block in download_chunk'
       13: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-s3-1.78.0/lib/aws-sdk-s3/object.rb:808:in `get'
       12: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-s3-1.78.0/lib/aws-sdk-s3/client.rb:4666:in `get_object'
       11: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-core-3.104.4/lib/seahorse/client/request.rb:72:in `send_request'
       10: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-core-3.104.4/lib/seahorse/client/plugins/response_target.rb:24:in `call'
        9: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-core-3.104.4/lib/aws-sdk-core/plugins/response_paging.rb:12:in `call'
        8: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-core-3.104.4/lib/seahorse/client/plugins/request_callback.rb:71:in `call'
        7: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-core-3.104.4/lib/aws-sdk-core/plugins/param_converter.rb:26:in `call'
        6: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-core-3.104.4/lib/aws-sdk-core/plugins/idempotency_token.rb:19:in `call'
        5: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-core-3.104.4/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:22:in `call'
        4: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-s3-1.78.0/lib/aws-sdk-s3/plugins/accelerate.rb:47:in `call'
        3: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-s3-1.78.0/lib/aws-sdk-s3/plugins/dualstack.rb:30:in `call'
        2: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-s3-1.78.0/lib/aws-sdk-s3/plugins/sse_cpk.rb:24:in `call'
        1: from /Users/bti/.rvm/gems/ruby-2.7.1/gems/aws-sdk-core-3.104.4/lib/seahorse/client/plugins/raise_response_errors.rb:17:in `call'
Aws::S3::Errors::InvalidRange (The requested range is not satisfiable)

I am wondering what we want to manage here? Wrapping the error?

Close: #93

benoittgt commented 4 years ago

PR in draft mode. I will fix the CI.

benoittgt commented 4 years ago

The PR looks good for a review. :)

julik commented 4 years ago

I am wondering what we want to manage here? Wrapping the error?

Can we consider the InvalidRangeError and friends read errors which would make the parser stop when reading? I believe we can rescue those (using a regexp so that you don't need the class definition) and re-raise as InvalidRead which is captured during parsing normally?

julik commented 4 years ago

Sorry didn't cover all of your questions

Do we want to wrap ActiveStorage errors? Like ActiveStorage::FileNotFoundError?

I am inclined not to - it is not the responsibility of the library

Did I properly handle BlobIO#size and @pos.bytesize?

I doubt @pos can have a bytesize. But blob sizes are always in bytes and all encodings are assumed to be binary, so what I see is perfect 👍

Different behavior with a io.read(NEGATIVE_INT). Local storage returns "", S3 returns all the file. I am wondering in which case we could have a negative n_bytes to read?

Only If we use a library which performs a read with negative bytes, and even then I would consider it a bug in that library. We can forbid it on the level of the IO constraint as I don't think it is a valid operation. For example a File object (the "ur-IO") does this:

[1] pry(main)> f = File.open('ttt', 'wb')
=> #<File:ttt>
[2] pry(main)> f.read(-3)
ArgumentError: negative length -3 given
from (pry):2:in `read'
benoittgt commented 4 years ago

Looks great! What I have questions about is the use of min.io. Given that we use a published (and public) API in ActiveStorage which has defined semantics, do we really need to test with an S3-like service? We are pulling quite a dependency in (and need to integrate it and ensure it keeps working, and it apparently has Ruby version limitations). Are we certain what we are testing cannot be reliably tested with DiskService alone?

I thought about this a lot. I was a little bit worried to have s3 issues that were not detected by the test suite because we were only using disk storage service (range for example, remote server status, etc). At the same time having a local s3 storage with min.io add a lot of code and also a dependency that could reduce the speed of development and the cost of maintaining the gem. Also I think we should not test s3 or other remote services again. download_chunk should be the only entry point, the details behind should not be tested by us.

Testing in a s3 scenario was interesting for development purpose but not sure it will serve that much expect as non-regression test.

It is up to you @julik :)

julik commented 4 years ago

It is up to you @julik :)

Right - in that case let's ditch S3 and minio :-) S3 has errors on about 0.1% - 0.4% of requests, but if this needs to be tackled it will be the storage service module in ActiveStorage where it needs to happen.

julik commented 4 years ago

Not sure I get what are "friends". Do you want to wrap at a different namespace level the errors for example /AWS::S3::Errors/

Ideally I thought we could somehow convert those errors into InvalidRead so that if a parser reads too much - and causes an error - FormatParser could proceed to activate the next parser in the set. But I think we might postpone this for later, and for now if there is an ActiveStorage-related error we let it bubble up through FormatParser and to the caller. If the feature proves useful and used we will narrow down on better rescuing strategies later? 💡

benoittgt commented 4 years ago

Hello @julik

Thanks a lot for your very clear feedback and explanation. I am completely ok to let this error pop, and improve the error handling if necessary. 🙏🏼

I removed the two commits with error rescuing.