broadinstitute / pooled-cell-painting-profiling-recipe

:woman_cook: Recipe repository for image-based profiling of Pooled Cell Painting experiments
BSD 3-Clause "New" or "Revised" License
6 stars 4 forks source link

handle pandas parse error #83

Closed ErinWeisbart closed 2 years ago

ErinWeisbart commented 2 years ago

closes #79

@bethac07 I'm tagging you because this affects the read_csvs_with_chunksize you contributed. Pandas ParseError has popped up a few times now (and stops the whole workflow when it happens) so we need to explicitly handle it. I added in a try-again to read_csvs_with_chunksize because it seems to be happening randomly and I thought perhaps a second attempt might catch potential data corruption happening from pulling data from S3 and allow us drop fewer site overall. Happy to have your feedback if there's a better way to handle this.

bethac07 commented 2 years ago

So you think it's just a random stochastic thing? Otherwise I'm not sure why running the same code again (rather than with a larger or smaller chunks or without chunks at all) would necessarily help, though it won't hurt.

I think otherwise catching it sounds like a good idea!

ErinWeisbart commented 2 years ago

I'm guessing it's stochastic because it happened to Greg on a site that I couldn't replicate when I manually downloaded the files and then when I ran the weld it passed the site that errored for Greg with no problem but errored on a different site. But I don't know for sure.

ErinWeisbart commented 2 years ago

If you have a suggestion for what to change inside the second attempt I'm happy to take them!

ErinWeisbart commented 2 years ago

Previous commit creates a configurable counter to set a max number of sites that we allow skipping after a Pandas ParseError (that I think happens stochastically). As I mentioned here and in #79, I'm guessing the error happens stochastically; regardless, it prevents a site from being included in our data processing so it seems important to me to limit the number of sites we skip as dropping some number of sites is likely fine, but we don't want the whole weld to happen with a limited subset of our data.

Skip counter variable added to Template here.

@MerajRamezani Would be nice to have your eyes on these changes too.

MerajRamezani commented 2 years ago

@MerajRamezani Would be nice to have your eyes on these changes too.

@ErinWeisbart this makes sense to me and sounds like a good solution to me.

This sounds like an issue with downloading from S3 and I was wondering if it could be addressed somehow? I have found this on GitHub but it is probably not relevant but wouldn't hurt to share here: https://github.com/aws/aws-cli/issues/2011