broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Make single cell .SQLite files publicly available #84

Closed gwaybio closed 2 years ago

gwaybio commented 2 years ago

We'd ideally like to make all single cell SQLite files publicly available. As @shntnu noted to me in a separate email, the lab has a process in place to accomplish this, which is great!

To summarize the plan that @shntnu outlined:

Step 1: Make SQLite files available via RODA

However, this step has two blocking tasks:

Step 2: Unarchive SQLite files

The big lift here is unarchiving the data

Unarchiving notes:

Step 3: Copy SQLite files

All we need to do here is copy the unarchived SQLite to s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/${batch}.

gwaybio commented 2 years ago

@shntnu please post if you have already thought through any strategy on how to accomplish this! Thanks!

shntnu commented 2 years ago

Thanks for penning it all down, @gwaygenomics!

We are pretty close to wrapping up with https://github.com/broadinstitute/cellpainting-gallery/issues/1 and https://github.com/awslabs/open-data-registry/pull/1003#issuecomment-957363361 will follow soon (no ETA right now but thankfully this step is not blocking Step 2 and Step 3)

Step 2 can happen right away but I'd need to request an author on https://www.biorxiv.org/content/10.1101/2021.10.21.465335v1 who has access to s3://imaging-platform to help with this. However, if there's anyone on your team who is able and willing, we can provide them AWS credentials to do this task.

Step 3 depends on Step 2. This is easy and I will gladly do it myself!

A Step 4 is to announce the availability of this data publicly in some fashion;Β that should wait for Step 1 because we want to credit RODA

gwaybio commented 2 years ago

Got it! Yes, this is helpful to document here.

However, if there's anyone on your team who is able and willing, we can provide them AWS credentials to do this task.

Unfortunately, there is no such person at the moment. The rotating students are wrapping up their rotation, and hiring has still be tough. If you're not able to find an author on the paper, LMK and I can try myself.

shntnu commented 2 years ago

If you're not able to find an author on the paper, LMK and I can try myself.

Will do

Do you have any specific timeline in mind?

gwaybio commented 2 years ago

Ideally sometime before the paper is published. I don't know how long that will take.

I'm also proposing to use the single cell data in a separate proposal, that I'd love to have by mid-September.

So let's say 4 months to a live AWS link?

shntnu commented 2 years ago

So let's say 4 months to a live AWS link?

A month has passed and unfortunately I am no closer to having someone available to take this on (it's not an easy lift unless one has been doing it routinely)

If you're not able to find an author on the paper, LMK and I can try myself.

I am happy to tag team with you on it, LMK

The good news is that we ARE getting closer to wrapping up with Step 1, thanks to @ErinWeisbart's efforts

shntnu commented 2 years ago

The good news is that we ARE getting closer to wrapping up with Step 1, thanks to @ErinWeisbart's efforts

We are done with Step 1 πŸŽ‰ (there will be incremental edits of course)

shntnu commented 2 years ago

Tagging @echterobert because this is relevant to his project

EchteRobert commented 2 years ago

Something happened to my AWS account, so I've asked BITS to help me out with that. As soon as that is fixed I can start working on step 2.

I'm relatively new to AWS so just to make sure I understand correctly, I simply need to:

Move the data from glacier here: https://imaging-platform-cold.s3.us-east-1.amazonaws.com/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2017_12_05_Batch2_${BATCH_ID}_backend.tar.gz to an accessible bucket here: s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/${BATCH_ID}/ using an EC2 instance as described here: https://github.com/broadinstitute/imaging-backup-scripts/blob/master/glacier_restore.md

And in step 3 we then copy the data from the imaging-platform bucket to the cellpainting gallery.

shntnu commented 2 years ago

@EchteRobert That's exactly right! I'd recommend using #ip-it Slack so you can get quick help from others in the group in case you are stuck

I can do Step 3

shntnu commented 2 years ago

Move the data from glacier here: https://imaging-platform-cold.s3.us-east-1.amazonaws.com/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2017_12_05_Batch2_${BATCH_ID}_backend.tar.gz to an accessible bucket here: s3://imaging-platform/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/${BATCH_ID}/ using an EC2 instance as described here: https://github.com/broadinstitute/imaging-backup-scripts/blob/master/glacier_restore.md

Oh, I should clarify:

First, you need to unarchive Batch1 (for now). I had pointed to an example where we did something similar for Batch2, but here, we do Batch1

Second, you only need to unarchive the data into a local EBS volume, and then your job is done (i.e. we are ready for Step 3, which I will do). That, is, you should stop when you reach "Sync to S3 bucket (if you want to restore to the original location on s3://imaging-platform)." (and not do that step).

This means that you will need to create a very large EBS volume that can accommodate all the files.

136 files need to be unarchived

aws s3 ls s3://imaging-platform-cold/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_SQ|grep backend.tar.gz|wc -l
     136

They amount to 1.3Tb zipped

aws s3 ls s3://imaging-platform-cold/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_SQ|grep backend.tar.gz|cut -d" " -f3|paste -sd+ - | bc
1344036686565

I don't know the compression ratio, but to be on the safe side, let's assume 5x and go with a 6TB EBS volume. Note that these are expensive ($50/TB/month), but we will be done in no more than a week, so that's fine.

shntnu commented 2 years ago

@EchteRobert sent me his bash notes

```sh sudo su mkdir ~/ebs_tmp/ cd ~/ebs_tmp sudo yum install git -y sudo amazon-linux-extras install epel sudo yum install nload sysstat parallel -y git clone https://github.com/broadinstitute/imaging-backup-scripts.git aws configure # enter credentials cd imaging-backup-scripts PROJECT_NAME=2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad BATCH_ID=2016_04_01_a549_48hr_batch1 parallel \ --results restore \ -a ../list_of_plates.txt \ ./glacier_restore.sh \ --project_name ${PROJECT_NAME} \ --batch_id ${BATCH_ID} \ --plate_id {1} \ --get_backend parallel \ --results restore \ -a ../list_of_plates.txt \ ./glacier_restore.sh \ --project_name ${PROJECT_NAME} \ --batch_id ${BATCH_ID} \ --plate_id {1} \ --get_backend \ --check_status cd ~/ebs_tmp parallel -a list_of_plates.txt "grep ^Download imaging-backup-scripts/restore/1/{1}/stdout|sed s,Download:,,1" > url_list.txt # done parallel -a list_of_plates.txt "grep MD5Download imaging-backup-scripts/restore/1/{1}/stdout|sed s,MD5Download:,,1" > md5_url_list.txt # done # done parallel -a url_list.txt aws s3 cp {1} . # done parallel -a md5_url_list.txt aws s3 cp {1} . # done PROJECT_NAME=2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad BATCH_ID=2016_04_01_a549_48hr_batch1 TARSET=backend # done parallel -a list_of_plates.txt tar -xvzf ${PROJECT_NAME}_${BATCH_ID}_{1}_${TARSET}.tar.gz # done parallel -a list_of_plates.txt \ "md5sum ${PROJECT_NAME}_${BATCH_ID}_{1}_${TARSET}.tar.gz > ${PROJECT_NAME}_${BATCH_ID}_{1}_${TARSET}.md5.local" # done (and no diffs) parallel -a list_of_plates.txt \ diff \ ${PROJECT_NAME}_${BATCH_ID}_{1}_${TARSET}.md5.local \ ${PROJECT_NAME}_${BATCH_ID}_{1}_${TARSET}.md5 > md5_diffs.txt ```
shntnu commented 2 years ago

I'm now uploading all the contents of backend (after moving folders around so that all backends are in backend) to s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/

e.g.:

.
└── 2016_04_01_a549_48hr_batch1
    β”œβ”€β”€ SQ00014812
    β”‚Β Β  β”œβ”€β”€ SQ00014812_augmented.csv
    β”‚Β Β  β”œβ”€β”€ SQ00014812.csv
    β”‚Β Β  β”œβ”€β”€ SQ00014812_normalized.csv
    β”‚Β Β  β”œβ”€β”€ SQ00014812_normalized_variable_selected.csv
    β”‚Β Β  β”œβ”€β”€ SQ00014812_normalized_variable_selected.gct
    β”‚Β Β  └── SQ00014812.sqlite
...

After uploading, I will delete everything except .csv and .sqlite from S3

Update:

The instance was too small, so I upgraded it to r6a.16xlarge and ran this command

parallel -a ~/plates.txt -j 17 \
  aws s3 sync --include "*" --exclude "*_augmented.csv" --exclude "*.gct" --exclude "*_normalized_variable_selected.csv" --exclude "*normalized.csv"  \
  backend/2016_04_01_a549_48hr_batch1/{1}/ \
  s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/{1}/

Because I am uploading only the files we need, we will be all set once this is done.

ETag calculations

Remote:

parallel -a ~/plates.txt "echo -n {1},;aws s3api get-object-attributes  --bucket cellpainting-gallery --key cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/{1}/{1}.csv --object-attributes ETag|jq '.ETag' -|tr -d '\\\"'|tr -d '\\\' "  > etag_remote.csv

Local:

#https://gist.github.com/rajivnarayan/1a8e5f2b6783701e0b3717dbcfd324ba
parallel -j 1 ./compute_etag.sh {} 8 ::: `find 2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/ -name "*.sqlite"` > etag_local.txt
``` SQ00014812,8a77728ebc5b1eaf377ec8ba3dff023a-2 SQ00014813,f3703e3ba7cacb0cc69245fdaf434636-2 SQ00014814,a5e91949958ac2a99abb79aa3042b02e-2 SQ00014815,fe3a16cec6fee6801b380a00af643a91-2 SQ00014816,d0d70f9ef9202ad246354bb09eb51ee7-2 SQ00014817,5193b38f895db1caaed01624674f1743-2 SQ00014818,b6450b54746a116bc448ed82923fdb7c-2 SQ00014819,f9b6dbdb749b1767d81a837a72ebe0ff-2 SQ00014820,59e8070028ae168518e1deecec770524-2 SQ00015041,6a7bb945540c5d9b863ccc868a6d9ab1-2 SQ00015042,d2afc86338becfc8494273d62044e22d-2 SQ00015043,f3ce6cba4da6f30c0f0622195a036bd0-2 SQ00015044,b7944c4bd25e5ec33bf8c85940767a02-2 SQ00015045,028d23b5eba59c7b2913785bdd70eec5-2 SQ00015046,8e511b7ac4290af70b33fbd884dac79a-2 SQ00015047,ceac4bf68c16d6e5724772fccf1e7233-2 SQ00015048,b4d85ddb0000c9d9520440234f42619f-2 SQ00015049,f102fa2ee559db5b255aa4f20e558ca7-2 SQ00015050,124d6f8b4b9e69cdeddf0a9b9e96fcc8-2 SQ00015051,e053874dbf3ec8001b1f13561e92626c-2 SQ00015052,c9b066bd6351114a0ae30112114ed46e-2 SQ00015053,5d83a493c5fac072a7af783ae6bb7e34-2 SQ00015054,980af9d3a8b0acd8d62ff94403617850-2 SQ00015055,136625578b5e63d809db69be3c4b7081-2 SQ00015056,87c11bdd45c1e625df1a8de79dec4ccb-2 SQ00015057,e820431f157feed51b5b19fdeeb8832e-2 SQ00015058,d56800282ee17fd84f12a615436c5a0c-2 SQ00015059,f4e5f71245429d81a96aab6117dbf08d-2 SQ00015096,902c2b4083a429e671a7a7eb62c3be60-2 SQ00015097,c2db7ac22b1f3dfc24cd0ef7fceec89d-2 SQ00015098,f0495bfea897785993a313ec8d87538d-2 SQ00015099,1bc205205ab6bbba5cbb6146a5cbbc04-2 SQ00015100,96d38770075e2d8f3f6fa7fb8ced436a-2 SQ00015101,1f3ae8c83e0146b23a2fc5f9378873d8-2 SQ00015102,5c71b2f401f6c8813a94e3da5ce2bcd0-2 SQ00015103,a518a86219763c475a214d85ee33600b-2 SQ00015105,2c69e7acbe47e70d7e9fd08a91bb44ef-2 SQ00015106,d8bd84f0239227ccaa8e5a0c184f7255-2 SQ00015107,25db6a193ea1d4b7988da29578b5a37d-2 SQ00015108,f24c81c759914f22cfe51897ef7d2e79-2 SQ00015109,d8e5602cf8c404bc0db4ea8953a16229-2 SQ00015110,04babf426f7d0548f12d48a0409302ff-2 SQ00015111,58461abaaecc2feefd8a6508eaab1cca-2 SQ00015112,c3f9d928fdcdb1bbd18a010177ae128c-2 SQ00015116,99a280d4fb934a352f074c8f089213a5-2 SQ00015117,7912d1763e709dfa4f433d26bd8bcb09-2 SQ00015118,bcc3dcf256076872beaa3c7798d37e28-2 SQ00015119,a37bd6bfdb14b1c62d81456becc07037-2 SQ00015120,9bef8c34b61bd3767ff3ec4dd6aed416-2 SQ00015121,313c65d71ad492b1cd57a13cd498c924-2 SQ00015122,f15e855724d5e8ad6b6f973f92b4550f-2 SQ00015123,450c092a757e08e42361a2f3d89a5d51-2 SQ00015124,305ac66e0eed754077caeca9028f4a48-2 SQ00015125,65013077c88c04f695eacb48ce364db2-2 SQ00015126,1e1068f1da437c15d92646cf5a134945-2 SQ00015127,60d87884fc5ca86858c745f190b021d8-2 SQ00015128,28c242b775e885767f6496d4b8d72881-2 SQ00015129,e1f1bcf96f488f2ccde3119141985d2e-2 SQ00015130,a5821b04cb5ef2fe740a85f8c14fc63f-2 SQ00015131,20ac6f81c7006b170643dd87557e3805-2 SQ00015132,ff42477ee9cf6daded6e587fbf97bf7d-2 SQ00015133,d7b0d48eb8bb6e48af73a2127831c170-2 SQ00015134,19814167ae120ee2639437e3ac1bc01e-2 SQ00015135,f94bde6e3ae067bbf7d2ff4d112d7d39-2 SQ00015136,7c3559ac8324b54779675d0cabc0082c-2 SQ00015137,389a36bde5c74c8e1374329f997989f2-2 SQ00015138,7339d6dce5663bb7166097506d8c86f8-2 SQ00015139,3b752102db6f6f2fb1ef40356f44edf7-2 SQ00015140,0a00fee374e2fd906cf688f0e502e9be-2 SQ00015141,cceb03d2f077f7ca61adfa2313581824-2 SQ00015142,2510401f20637d4cb7aa34b03a90aefd-2 SQ00015143,77224406bfff58e3f50fd35e2c0030f6-2 SQ00015144,6c9706cb44943bbbc69dd6ddfd71cce7-2 SQ00015145,8fca8acfd8734c8093f15a295719e7af-2 SQ00015146,b76018c790a3d3a3417a2a1fd4e8ee0b-2 SQ00015147,385e7a694072eb71aa45e4ea49108b5d-2 SQ00015148,1ee7627b785652c0356ec4d0030ea9a0-2 SQ00015149,add49050f95efe7f6064c01ac004cd4a-2 SQ00015150,d4d87a238e04b5f40c9135c6db63042a-2 SQ00015151,200d11bbdf151be1f0418cb0cda819c3-2 SQ00015152,0b230b03a63487fb4b6264f83e7aa5e7-2 SQ00015153,ae2f22304f58c669e23f9a21c31d9e66-2 SQ00015154,8d6898264db54b7319c1e8b05e8f1c41-2 SQ00015155,ee108348d4c87f158d882c52ba56892a-2 SQ00015156,8ec9f0fe1b0014f54023a5a74f0e85f1-2 SQ00015157,78a3bdb5d03cdc609fe6ace9b7e79cdb-2 SQ00015158,237cd3effb1227f8b0bd18880a105019-2 SQ00015159,5ce9a5a5b90038314eaf9e2702176ff5-2 SQ00015160,7f5af2297b6c5b4490637ff75b8b0b53-2 SQ00015162,c98ca86004be326035bd4e17d95f388c-2 SQ00015163,a4192e5106f44cf8f6e184a0f518f5a3-2 SQ00015164,0a9a71e68d729b848f5a9f2b0c2e2547-2 SQ00015165,7424d91fe172dcfc55fff993e00d56ec-2 SQ00015166,d9e556ab9b6ed1485fee697294ac139f-2 SQ00015167,62a1d0f5985d465fcc1ed41b59d81634-2 SQ00015168,0a044f7747eb1856debd9fd451039dc2-2 SQ00015169,658974bdef83a0f845191303f4786c6d-2 SQ00015170,22cb66c75d464c6a610cd2aacf078882-2 SQ00015171,29565e7e3b2da94c3c24252ba3d37fc7-2 SQ00015172,35d8a04093483e841d86d65c2e6efac2-2 SQ00015173,8a435c4aeff7461ff80c43a494e12002-2 SQ00015194,95769dfe6d4a2dc023cb30b38922cf3c-2 SQ00015195,9f3efb987b35b71e6d90d558b1d88189-2 SQ00015196,094c94be1ce8b82e399c309d5641a494-2 SQ00015197,542ec57702c6f74c8ed7ec42b60f7ba4-2 SQ00015198,4cbb033d995cc48d40b653d88d938412-2 SQ00015199,762c09034b0fad28ce674e17cb984657-2 SQ00015200,22311e760fa1cfa0b24dc8bb2bd61519-2 SQ00015201,822463f15d3002fbdfff9a0bddbf4d5b-2 SQ00015202,4375b153d8969b7c832e26bea596ccb4-2 SQ00015203,30ed569f9b76b611c01b7012aca57037-2 SQ00015204,c84970caea6488fdf0f2a03c5e2ab50a-2 SQ00015205,97eda8759416fa0a3017c505a25181c1-2 SQ00015206,374b57931f90c03e6c2e521446a852af-2 SQ00015207,e6c2c73e0e2eb7ae027c3328b7684241-2 SQ00015208,7db18debb1ddd1d3721e27924189df77-2 SQ00015209,b643f06fb64ea0bc5e00eed305a707f2-2 SQ00015210,f250d0e26cfeab0edeed02496d4e5065-2 SQ00015211,81829c5664a338ccdf760d1b67d297f7-2 SQ00015212,ecb5e61c5de2b68e62917cdb6b14191c-2 SQ00015214,2b4d7d334d83b5bf4d665cf01eff79de-2 SQ00015215,256a5e8b79e104186b4e0e7489a238ae-2 SQ00015216,b2d6c486ec0cb8975fdfc72b8a8bad46-2 SQ00015217,c95f492d3707546a9f5a2188f29502b4-2 SQ00015218,b7df7e20a335dbbbae15776ee19e0841-2 SQ00015219,a62e401496b4e6bdb2f6e017ce8dfba5-2 SQ00015220,3acdc59bcb9936de722d99d0b2671925-2 SQ00015221,43bb1705936834d1c20885064cb5562d-2 SQ00015222,c342f21fb0023e164dd137a05cb6191d-2 SQ00015223,f299e7b88f1bc12448ee5913b906abc3-2 SQ00015224,484df3313e427ce16db20f3b39cab36b-2 SQ00015229,c1926e4a030c56e9fe964b52e87ae76b-2 SQ00015230,9b58e865a87c793c1c4affdc3eb24417-2 SQ00015231,12238708bde327d459f38b3a897b9885-2 SQ00015232,2df8c332bb5de1197646f71c78b240c7-2 SQ00015233,991dc511d9d71fc01fc3ae48c0235647-2 ```

Compare ETags like this

diff <(cat etag_remote.csv |sort) <(cat etag_local.txt|cut -d"/" -f6|sed s,.sqlite,,g|tr "\t" ","|sort)
shntnu commented 2 years ago

Done!

gwaybio commented 2 years ago

Yes!!! Thanks so much for the work on this @shntnu and @EchteRobert πŸŽ‰πŸŽ‰

EchteRobert commented 2 years ago

Awesome!