apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.84k stars 4.25k forks source link

[Bug]: Python GCSFileSystem.delete does not recursively delete #27605

Open timblakely opened 1 year ago

timblakely commented 1 year ago

What happened?

In the Python SDK, GCSFileSystem.delete suggests directories will be deleted recursively, but that doesn't appear to be the case...?

e.g.I have bucket blakely_dev and the following paths:

gs://blakely_dev/_staging/iteration/1/result gs://blakely_dev/_staging/iteration/1/output-00000-of-00002 gs://blakely_dev/_staging/iteration/1/output-00001-of-00002

If I pass gs://blakely_dev/_staging/ to .delete(), despite it being a directory and a wildcard being appended if it ends with a /, the following .match() call within .delete() matches neither subdirectories nor the result or output-0000.* files.

Issue Priority

Priority: 3 (minor)

Issue Components

tvalentyn commented 1 year ago

Thanks for reporting! What happens if you delete gs://blakely_dev/_staging/iteration/1/ ? Note that in GCS there is no concept of directories. there are buckets and objects. / is just a symbol in the object name.

tvalentyn commented 1 year ago

https://stackoverflow.com/questions/52789714/google-cloud-storage-how-to-delete-a-folder-recursively-in-python has some examples how to fetch objects starting with a particular prefix. might be easier once https://github.com/apache/beam/issues/25676 is fixed.

tvalentyn commented 1 year ago

cc: @BjornPrime

BjornPrime commented 1 year ago

.take-issue

timblakely commented 1 year ago

Thanks for reporting! What happens if you delete gs://blakely_dev/_staging/iteration/1/ ? Note that in GCS there is no concept of directories. there are buckets and objects. / is just a symbol in the object name.

Yup, I'm aware :) That does remove all the objects, but doesn't "recursively" work.

FYI the match() function seems to function slightly differently than the GCS py client's bucket.list_blobs() as that takes a prefix and delimiter that, if the prefix ends with the delimiter, will return both delimiter-separated "directories" and the files with that prefix. If no delimiter is passed, it matches all files with the prefix, which is what it would seem that match() is intending to do (at least from the docstring :).

tsafacjo commented 1 month ago

@ AnandInguva Is this issue update ?

liferoad commented 1 month ago

cc @shunping

tsafacjo commented 1 month ago

Can I pick it ?

liferoad commented 1 month ago

@tsafacjo Please do that. Some guides are here: https://github.com/apache/beam/blob/master/contributor-docs/code-change-guide.md and https://beam.apache.org/contribute/#ways-you-can-contribute. Thanks!

tsafacjo commented 1 month ago

thanks

tsafacjo commented 1 month ago

@liferoad it looks like if this PR https://github.com/apache/beam/pull/29477/files#diff-c12c6d027caa8ddf49ae5488f38ebbdf798e8ae85d7d0d716c0ebd8cce9477fe already solved the problem.

liferoad commented 1 month ago

@AnandInguva what is the problem for your PR?