Skipping partition as no new files detected

RomeLeader commented 4 years ago

Hi,

My log bucket is fairly large in size, however we have Glaicered anything older than three months. When I run the job, I get the following, as it completes in a minute or two:

19/09/25 13:26:02 WARN HadoopDataSource: Skipping Partition
{}as no new files detected @ s3://<BUCKET>/ / or path does not exist

where is the name of my S3 access log storage bucket.

My logs are being saved at top-level in the S3 bucket, i.e. all log files are at s3:///

What could be happening here? I know there are logs in the bucket that are not partitioned, and the converted DB/tables are empty when I preview them. I have given the classification of the raw data table as CSV, but I am not sure what is correct.

Any pointers would be appreciated!

MarcusElwin commented 2 years ago

We get a similar issue when a file is not in s3 and an empty DataFrame is still created, shouldn't this raise an exception?:

22/06/30 08:52:18 WARN HadoopDataSource: Skipping Partition {} as no new files detected @ s3://sample-bucket/test/dict_most_common_names_old.csv or path does not exist
Empty DataFrame
Columns: []
Index: []
<class 'pandas.core.frame.DataFrame'>

MyJBMe commented 1 year ago

I experienced the same error. Turned out my glue job just did not have enough permissions. Thereby you may check your assigned role.

TLazarevic commented 1 year ago

What permissions were you missing @MyJBMe ?

awslabs / athena-glue-service-logs

Skipping partition as no new files detected #18