galv / lingvo-copy

Apache License 2.0
4 stars 0 forks source link

Escape data IDs with wildcards #18

Open galv opened 3 years ago

galv commented 3 years ago

spark cannot load google cloud bucket files that contain wildcards (e.g., gs://the-peoples-speech-west-europe/archive_org/Nov_6_2020/ALL_CAPTIONED_DATA/1961DoctorBloodsCoffinWKieronMoore/[1961]Doctor Blood's Coffin w Kieron Moore.mp3). Characters like "[" and "]" trigger the problem, but others may as well.

Reproducer:

spark.read.format("binaryFile").read("gs://the-peoples-speech-west-europe/archive_org/Nov_6_2020/ALL_CAPTIONED_DATA/1961DoctorBloodsCoffinWKieronMoore/[1961]Doctor Blood's Coffin w Kieron Moore.mp3")

It should give an error.

Related issue (although it doesn't talk about spark itself): https://stackoverflow.com/questions/42087510/gsutil-ls-returns-error-contains-wildcard/42146769