Missing data if you have multiple report_keys

Nr18 commented 6 years ago

The data should be appended not overwrite the previous data, if you have multiple report_keys only the last report will be present in the file that is being read by Athena.

concurrencylabs commented 6 years ago

Every new report file that is generated by AWS contains all the data from the previous key plus new data. That's why the tool overwrites the previous key, so Athena has the latest view.

Nr18 commented 6 years ago

As far as i saw it that is true for each dump that they send to your S3 bucket not for the reports that are in a single hash.

If what you are saying is true, then you could only read the last report but your code downloads them all and then write it to a file and the next file will overwrite the previous content.

I'm converting my data as i'm typing this so if i converted my reports i will query for doubles (and see if my missing is now accounted for)

Nr18 commented 6 years ago

I can confirm that after i ran the processor with the append mode the total of the lineitem_unblendedcost matched the actual invoice of that month. Without the append mode i'm 60% off.

concurrencylabs commented 6 years ago

You're right, thanks for clarifying. The aggregate of all files AWS places inside a new hash folder contain data from the previous hash folder plus new data. So, it makes sense to append all data contained in a single hash folder.

Also, the reason files are downloaded is because Athena breaks when there's a header row, therefore the code removes the first row in each report file and it generates the table based on the manifest AWS places in S3. Athena doesn't seem to recognize the setting TBLPROPERTIES ("skip.header.line.count"="1"); in org.apache.hadoop.hive.serde2.OpenCSVSerde

Nr18 commented 6 years ago

So @concurrencylabs have you seen this: https://aws.amazon.com/about-aws/whats-new/2018/01/amazon-athena-engine-upgrades--support-for-lambda-expressions--and-ability-to-skip-headers/

So that means that you don't need to aggregate the files into 1 file and re-upload it, just updating the Athena table to the latest hash would be enough as far as i can see.

The reason why i'm looking into this is that you now have a single lambda function processing the objects on s3 with my current data set that takes almost 3 a 4 minutes with 3 a 4 report_keys. So i was already looking at an option to do this in parallel in separate lambda functions as Lambda has a timeout limit of 5 minutes.

concurrencylabs commented 6 years ago

I wasn't aware of this new feature where Athena can now ignore headers, so thanks for bringing it up! You're right, this should solve that annoying problem that meant files had to be dowloaded, the header removed and then a new file uploaded to S3. For large accounts, it could mean the function would time out.

There's one thing to keep in mind, though. Athena doesn't like it when files other than data are present in a particular path. This means either 1)CUR data files are copied to a new location, 2) the CUR xxx-Manifest.json file that is placed under the hash folder has to be moved somewhere else (or removed, since AWS also places a Manifest.json one level up from the hash location. That's the manifest that is used for creating the Athena tables.

I like option #2 better, if it makes sense to move the hash-level manifest, then moving it to a folder inside the same bucket, that only keeps manifests, and rename it in a way that keeps the hash name in it, to keep track of history and not lose data (i.e. manifests/-Manifest.json).

Nr18 commented 6 years ago

I also would go for your second option why move (potential large) data around if you don't have too. In terms of cost optimization you probably want to remove the previously dumped hashes and only keep the last x dumps.

Another thing you could do is when the previous month is "closed" you could convert the dumps to a columnar storage format on a different Athena table containing all months so that you are able to query across multiple months while being cost efficient.

concurrencylabs / aws-cost-analysis

Missing data if you have multiple report_keys #10