Closed gwaybio closed 2 years ago
@shntnu please post if you have already thought through any strategy on how to accomplish this! Thanks!
Thanks for penning it all down, @gwaygenomics!
We are pretty close to wrapping up with https://github.com/broadinstitute/cellpainting-gallery/issues/1 and https://github.com/awslabs/open-data-registry/pull/1003#issuecomment-957363361 will follow soon (no ETA right now but thankfully this step is not blocking Step 2 and Step 3)
Step 2 can happen right away but I'd need to request an author on https://www.biorxiv.org/content/10.1101/2021.10.21.465335v1 who has access to s3://imaging-platform to help with this. However, if there's anyone on your team who is able and willing, we can provide them AWS credentials to do this task.
Step 3 depends on Step 2. This is easy and I will gladly do it myself!
A Step 4 is to announce the availability of this data publicly in some fashion;Β that should wait for Step 1 because we want to credit RODA
Got it! Yes, this is helpful to document here.
However, if there's anyone on your team who is able and willing, we can provide them AWS credentials to do this task.
Unfortunately, there is no such person at the moment. The rotating students are wrapping up their rotation, and hiring has still be tough. If you're not able to find an author on the paper, LMK and I can try myself.
If you're not able to find an author on the paper, LMK and I can try myself.
Will do
Do you have any specific timeline in mind?
Ideally sometime before the paper is published. I don't know how long that will take.
I'm also proposing to use the single cell data in a separate proposal, that I'd love to have by mid-September.
So let's say 4 months to a live AWS link?
So let's say 4 months to a live AWS link?
A month has passed and unfortunately I am no closer to having someone available to take this on (it's not an easy lift unless one has been doing it routinely)
If you're not able to find an author on the paper, LMK and I can try myself.
I am happy to tag team with you on it, LMK
The good news is that we ARE getting closer to wrapping up with Step 1, thanks to @ErinWeisbart's efforts
The good news is that we ARE getting closer to wrapping up with Step 1, thanks to @ErinWeisbart's efforts
We are done with Step 1 π (there will be incremental edits of course)
Tagging @echterobert because this is relevant to his project
Something happened to my AWS account, so I've asked BITS to help me out with that. As soon as that is fixed I can start working on step 2.
I'm relatively new to AWS so just to make sure I understand correctly, I simply need to:
Move the data from glacier here:
https://imaging-platform-cold.s3.us-east-1.amazonaws.com/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2017_12_05_Batch2_${BATCH_ID}_backend.tar.gz
to an accessible bucket here:
s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/${BATCH_ID}/
using an EC2 instance as described here:
https://github.com/broadinstitute/imaging-backup-scripts/blob/master/glacier_restore.md
And in step 3 we then copy the data from the imaging-platform bucket to the cellpainting gallery.
@EchteRobert That's exactly right! I'd recommend using #ip-it Slack so you can get quick help from others in the group in case you are stuck
I can do Step 3
Move the data from glacier here:
https://imaging-platform-cold.s3.us-east-1.amazonaws.com/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2017_12_05_Batch2_${BATCH_ID}_backend.tar.gz
to an accessible bucket here:s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/${BATCH_ID}/
using an EC2 instance as described here: https://github.com/broadinstitute/imaging-backup-scripts/blob/master/glacier_restore.md
Oh, I should clarify:
First, you need to unarchive Batch1 (for now). I had pointed to an example where we did something similar for Batch2, but here, we do Batch1
Second, you only need to unarchive the data into a local EBS volume, and then your job is done (i.e. we are ready for Step 3, which I will do). That, is, you should stop when you reach "Sync to S3 bucket (if you want to restore to the original location on s3://imaging-platform)." (and not do that step).
This means that you will need to create a very large EBS volume that can accommodate all the files.
136 files need to be unarchived
aws s3 ls s3://imaging-platform-cold/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_SQ|grep backend.tar.gz|wc -l
136
They amount to 1.3Tb zipped
aws s3 ls s3://imaging-platform-cold/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_SQ|grep backend.tar.gz|cut -d" " -f3|paste -sd+ - | bc
1344036686565
I don't know the compression ratio, but to be on the safe side, let's assume 5x and go with a 6TB EBS volume. Note that these are expensive ($50/TB/month), but we will be done in no more than a week, so that's fine.
@EchteRobert sent me his bash notes
I'm now uploading all the contents of backend
(after moving folders around so that all backends are in backend
) to s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/
e.g.:
.
βββ 2016_04_01_a549_48hr_batch1
βββ SQ00014812
βΒ Β βββ SQ00014812_augmented.csv
βΒ Β βββ SQ00014812.csv
βΒ Β βββ SQ00014812_normalized.csv
βΒ Β βββ SQ00014812_normalized_variable_selected.csv
βΒ Β βββ SQ00014812_normalized_variable_selected.gct
βΒ Β βββ SQ00014812.sqlite
...
After uploading, I will delete everything except
Update:
The instance was too small, so I upgraded it to r6a.16xlarge
and ran this command
parallel -a ~/plates.txt -j 17 \
aws s3 sync --include "*" --exclude "*_augmented.csv" --exclude "*.gct" --exclude "*_normalized_variable_selected.csv" --exclude "*normalized.csv" \
backend/2016_04_01_a549_48hr_batch1/{1}/ \
s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/{1}/
Because I am uploading only the files we need, we will be all set once this is done.
ETag calculations
Remote:
parallel -a ~/plates.txt "echo -n {1},;aws s3api get-object-attributes --bucket cellpainting-gallery --key cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/{1}/{1}.csv --object-attributes ETag|jq '.ETag' -|tr -d '\\\"'|tr -d '\\\' " > etag_remote.csv
Local:
#https://gist.github.com/rajivnarayan/1a8e5f2b6783701e0b3717dbcfd324ba
parallel -j 1 ./compute_etag.sh {} 8 ::: `find 2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/ -name "*.sqlite"` > etag_local.txt
Compare ETags like this
diff <(cat etag_remote.csv |sort) <(cat etag_local.txt|cut -d"/" -f6|sed s,.sqlite,,g|tr "\t" ","|sort)
Done!
Yes!!! Thanks so much for the work on this @shntnu and @EchteRobert ππ
Awesome!
We'd ideally like to make all single cell SQLite files publicly available. As @shntnu noted to me in a separate email, the lab has a process in place to accomplish this, which is great!
To summarize the plan that @shntnu outlined:
Step 1: Make SQLite files available via RODA
However, this step has two blocking tasks:
Step 2: Unarchive SQLite files
The big lift here is unarchiving the data
Unarchiving notes:
Step 3: Copy SQLite files
All we need to do here is copy the unarchived SQLite to s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/${batch}.