broadinstitute / pe2loaddata

Script to parse a Phenix metadata XML file and generate a .CSV for CellProfiler's loaddata module
BSD 2-Clause "Simplified" License
1 stars 6 forks source link

PathName_* s incorrect #24

Closed ErinWeisbart closed 2 years ago

ErinWeisbart commented 2 years ago

Triggering packaged pe2loaddata using search_subdirectories=True and "s3" in index_file.

PathName_* are not being made correctly. It seems to use the last plate in the folder not the plate that is being passed in.

(Metadata_Plate, FileName_Illum and PathName_Illum are created correctly with the plate that is passed in.)

ErinWeisbart commented 2 years ago

The problem is here

                # Create paths dictionary
                index_directory_key = index_directory.split(f"s3://{bucket}/")[1]
                paginator = s3.get_paginator("list_objects_v2")
                pages = paginator.paginate(Bucket=bucket, Prefix=index_directory_key)
                try:
                    for page in pages:
                        for x in page["Contents"]:
                            fullpath = x["Key"]
                            path, filename = fullpath.rsplit("/", 1)
                            if filename.endswith(".tiff"):
                                paths[filename] = path

Paginating over an index directory that has all the batch folders in it doesn't play well with the dictionary because each batch/folder has the same list of file names.

ErinWeisbart commented 2 years ago

I believe this is fixed by changing: index_directory_key = index_directory.split(f"s3://{bucket}/")[1] to index_directory_key = index_directory.split(f"s3://{bucket}/")[1] + plate_id

bethac07 commented 2 years ago

I don't think that fix will work, because you're just adding the plate onto whatever it is.

What are you passing in as the index_directory? I would assume it would be the plate folder, not the batch folder, can you confirm?

ErinWeisbart commented 2 years ago

Oh drat. I had set it to index_directory = f"s3://{bucket}/projects/{project_name}/{batch}/images/" to most closely match what we usually pass, but I got muddled in mapping EFS to S3 and what we usually pass ends at the second 'images' ~/efs/${PROJECT_NAME}/workspace/images/${BATCH_ID}/${PLATE_ID}/Images.

So never mind all this... I just have to change what I'm passing to index_directory = f"s3://{bucket}/projects/{project_name}/{batch}/images/{plate}"