broadinstitute / neural-profiling

1 stars 2 forks source link

Cleaning up the data folder and migrating to cellpainting-gallery #10

Closed shntnu closed 2 years ago

shntnu commented 2 years ago

@michaelbornholdt

https://github.com/broadinstitute/neural-profiling#experimental-data-on-s3 is super helpful!

Do you happen to know why there are 3 different places where locations is stored?

workspace/deep_learning/inputs/locations/
workspace/deep_learning/locations/
workspace/locations/

workspace/deep_learning/inputs/locations/ appears to be the most recent and has 136 folders, so maybe that's the one we need to keep and delete the others?

aws s3 ls  s3://jump-cellpainting/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/deep_learning/inputs/locations/ |grep SQ|wc -l
     136

while the others don't

aws s3 ls  s3://jump-cellpainting/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/deep_learning/locations/ |grep SQ|wc -l
     118
aws s3 ls  s3://jump-cellpainting/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/locations/ |grep SQ|wc -l
      19

I also note that workspace/locations_backups/ has larger CSV files, 136 in total; you refer to them in https://github.com/broadinstitute/neural-profiling#experimental-data-on-s3. Were they meant to be collated versions of the files?

aws s3 ls --human-readable  s3://jump-cellpainting/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/location_backups/ |head
2021-09-08 16:34:47   38.5 MiB SQ00014812_loc.csv
2021-09-08 16:34:47   38.0 MiB SQ00014813_loc.csv
2021-09-08 16:34:47   39.7 MiB SQ00014814_loc.csv
2021-09-08 16:34:47   40.0 MiB SQ00014815_loc.csv
2021-09-08 16:34:47   44.2 MiB SQ00014816_loc.csv
2021-09-08 16:34:47   46.8 MiB SQ00014817_loc.csv
2021-09-08 16:34:47   43.2 MiB SQ00014818_loc.csv
2021-09-08 16:34:47   44.7 MiB SQ00014819_loc.csv
2021-09-08 16:34:47   45.1 MiB SQ00014820_loc.csv
2021-09-08 16:34:47   43.9 MiB SQ00015041_loc.csv

cc @jccaicedo

shntnu commented 2 years ago

For my notes: We are moving this data to s3://cellpainting-gallery. I will update the URLs in the README once done

Tracked here https://github.com/jump-cellpainting/aws/issues/63#issuecomment-1033059100

shntnu commented 2 years ago

Now all being tracked in https://github.com/jump-cellpainting/aws/issues/63#issue-972973940