IDR / idr-metadata

Curated metadata for all studies published in the Image Data Resource
https://idr.openmicroscopy.org
14 stars 24 forks source link

Document NGFF Fileset replacement workflow #656

Open will-moore opened 1 year ago

will-moore commented 1 year ago

NGFF generation

Generation takes place on pilot-zarr1-dev or pilot-zarr2-dev machines.

We need to generate NGFF data with https://github.com/IDR/bioformats2raw/releases/tag/v0.6.0-24 which has ZarrReader fixes, including those required for .pattern file data.

Install bioformats2raw via conda:

conda create -n bioformats2raw python=3.9
conda activate bioformats2raw
conda install -c ome bioformats2raw

This is actually just for getting the dependencies installed. Get the actual bioformats2raw from the link above and just unzip it into your home directory.

We need to generate NGFF Filesets under /data volume. Create a directory for the idr project and memo files (if it’s not already there), and change into the idr directory. For example for idr0051:

cd /data
sudo mkdir idr0051
sudo chown yourname idr0051
sudo mkdir memo
sudo chown yourname memo
cd idr0051

Find out where the pattern, screen or companion files are. For example: /nfs/bioimage/drop/idr0051-fulton-tailbudlightsheet/patterns/

Then run the conversion (using the bioformat2raw from above) in a screen (long running):

NB: it may be useful to convert a single Fileset to zarr initially to determine the size of this on disk and to tell whether you have enough space to convert all the others at once. If not, might have to do a smaller number, zip and upload to BioStudies before deleting to make space available.

NB: please make sure that the --memo-directory specified here is writable by you.

screen -S idr0051ngff

for i in `ls /nfs/bioimage/drop/idr0051-fulton-tailbudlightsheet/patterns/`; do echo $i; ~/bioformats2raw-0.6.0-24/bin/bioformats2raw --memo-directory ../memo /nfs/bioimage/drop/idr0051-fulton-tailbudlightsheet/patterns/$i ${i%.*}.ome.zarr; done

($i is the pattern file, ${i%.*}.ome.zarr strips the .pattern file extension and adds .ome.zarr; this should work for pattern, screen and also companion file extensions)

Upload to EBI s3 for testing

Upload 1 or 2 Plates or Images to EBI's s3, so we can validate that the data can be viewed and imported on s3.

Create a bucket from local aws install: Once installed aws just do aws configure and enter Access key and Secret key - use defaults for other options.

$ aws --endpoint-url https://uk1s3.embassy.ebi.ac.uk s3 mb s3://idr0010
make_bucket: idr0010

And update policy and CORS config as at https://github.com/IDR/deployment/blob/master/docs/object-store.md#policy (NB: replace idr0000 with e.g. idr0010 in the sample config etc)

Upload the data using mc, installed on dev servers where data is generated:

$ ssh pilot-zarr1-dev
$ wget https://dl.min.io/client/mc/release/linux-amd64/mc
$ ./mc config host add uk1s3 https://uk1s3.embassy.ebi.ac.uk
Enter Access Key: X8GE11ZK************
Enter Secret Key: 
Added `uk1s3` successfully.

$ /home/wmoore/mc cp -r idr0010/ uk1s3/idr0010/zarr

You should now be able to view and do some validation of the data with ome-ngff-validator and vizarr. E.g. https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr0025/zarr/10x+images+plate+3.ome.zarr

https://hms-dbmi.github.io/vizarr/?source=https://uk1s3.embassy.ebi.ac.uk/idr0025/zarr/10x+images+plate+3.ome.zarr

Submission to BioStudies

Once the NGFF data has been validated to your satisfaction, we can upload to BioStudies.

We need to create a .zip file for each .ome.zarr Fileset. It can be useful where space is short to use -m to move files into the zip and delete the original.

For a single zarr, this looks like $ zip -mr image.ome.zarr.zip image.ome.zarr.

Convert all the zarr Filesets for a study: E.g:

screen -S idr0010_zip
cd idr0010
for i in */; do zip -mr "${i%/}.zip" "$i"; done

This will create zips in the same dir as the zarrs, but we want a directory that contains just the zips for upload...

mkdir idr0010
mv *.zip idr0010/

Upload via Aspera, using the "secret directory". Login to BioStudies with the IDR account. Click on the FTP/Aspera button at https://www.ebi.ac.uk/biostudies/submissions/files

# install...
$ wget https://ak-delivery04-mul.dhe.ibm.com/sar/CMA/OSA/08q6g/0/ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh
$ chmod +x ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh 
$ bash ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh 
$ cd .aspera/cli/bin

$ ./ascp -P33001 -i ../etc/asperaweb_id_dsa.openssh -d /path/to/idr00xx bsaspera_w@hx-fasp-1.ebi.ac.uk:xx/xxxxxxxxxxxxxxxxxxxxxxx

Some JavaScript you can run in browser console to get the file names in the submission table:

let names = [];
[].forEach.call(document.querySelectorAll("div [role='row'] .ag-cell[col-id='name']"), function(div) {
  names.push(div.innerHTML.trim());
});
console.log(names.join("\n"));
console.log(names.length);

Create a tsv file that lists all the filesets for the submission with the first column named Files. See https://www.ebi.ac.uk/bioimage-archive/help-file-list/

E.g. idr0054_files.tsv:

Files
idr0054/Tonsil 1.ome.zarr.zip
idr0054/Tonsil 2.ome.zarr.zip
idr0054/Tonsil 3.ome.zarr.zip

Upload this to the same location as above (via FTP or using the web UI). This is used to specify which files to be used in the submission. You should be able to see all the uploaded files at https://www.ebi.ac.uk/biostudies/submissions/files

Create a new submission at https://www.ebi.ac.uk/biostudies/submissions/

Once submitted, we need to ask EBI to process the submission, unzip each zarr and upload data to s3 BioStudies will assign a uuid to each. They will provide a mapping from each zip file to uuid.zarr as csv:

Spreadsheet for keeping track of the submissions status: https://docs.google.com/spreadsheets/d/1P3dn-uL9KzE9O7XAKhpL8fUMTG3LWedMgjzSdnfAjQ4/edit#gid=0

Tonsil 2.ome.zarr.zip, https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD704/36cb5355-5134-4bdc-bde6-4e693055a8f9/36cb5355-5134-4bdc-bde6-4e693055a8f9.zarr/0
Tonsil 1.ome.zarr.zip, https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD704/5583fe0a-bbe6-4408-ab96-756e8e96af55/5583fe0a-bbe6-4408-ab96-756e8e96af55.zarr/0
Tonsil 3.ome.zarr.zip, https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD704/3b4a8721-1a28-4bc4-8443-9b6e145efbe9/3b4a8721-1a28-4bc4-8443-9b6e145efbe9.zarr/0

This needs to be used to create the necessary symlinks below.

If not already done, mount the bia-integrator-data bucket on the server machine and check to see if files are available:

$ sudo mkdir /bia-integrator-data && sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data

$ ls /bia-integrator-data/S-BIAD704
36cb5355-5134-4bdc-bde6-4e693055a8f9  3b4a8721-1a28-4bc4-8443-9b6e145efbe9  5583fe0a-bbe6-4408-ab96-756e8e96af55

Make NGFF Filesets

Work In progress Use https://github.com/joshmoore/omero-mkngff to create filesets based on the mounted s3 NGFF Filesets.

See https://github.com/IDR/idr-utils/pull/56 as a script for generating inputs required for omero-mkngff.

conda create -n mkngff -c conda-forge -c ome omero-py bioformats2raw
conda activate mkngff
pip install 'omero-mkngff @ git+https://github.com/joshmoore/omero-mkngff@main'
omero login demo@localhost

omero mkngff setup > setup.sql
omero mkgnff sql --secret=$SECRET 5287125 a.ome.zarr/ > my.sql
sudo -u postgres psql idr < setup.sql
sudo -u postgres psql idr < my.sql

sudo -u omero-server mkdir /data/OMERO/ManagedRepository/demo_2/Blitz-0-Ice.ThreadPool.Server-2/2023-06/22/12-46-39.975_converted/
mv a.ome.zarr /tmp
ln -s /tmp/a.ome.zarr /data/OMERO/ManagedRepository/demo_2/Blitz-0-Ice.ThreadPool.Server-2/2023-06/22/12-46-39.975_converted/a.ome.zarr
omero render test Image:14834721 # Failing here

Validation

See https://github.com/IDR/idr-utils/pull/55 Checkout that branch of idr-utils (if not merged yet etc).

The script there allows us to check the pixel data for the lowest resolution of each image in a study, validating that each plane is identical to the corresponding one in IDR.

This could take a while, so lets run as a screen...

sudo -u omero-server -s
screen -S idr0012_check_pixels
source /opt/omero/server/venv3/bin/activate
omero login demo@localhost
cd /uod/idr/metadata/idr-utils/scripts
python check_pixels.py Plate:4299 /tmp/check_pixels_idr0012.log

Archived workflow below

The sections below were using a previous workflow (prior to the omero-mkngff approach)

Make a metadata-only copy of the data

Since we want to import NGFF data without chunks, we need to create a copy of the data without chunks for import. The easiest way to do this is to use aws to sync the data, ignoring chunks.

We want these to be owned by omero-server user in a location they can access, so they can be imported. Location at import time isn't too important.

$ screen -S idr0010_aws_sync      # can take a while if lots of data    
$ mkdir idr0010
$ cd idr0010
$ aws s3 sync --no-sign-request --exclude '*' --include "*/.z*" --include "*.xml" --endpoint-url https://uk1s3.embassy.ebi.ac.uk s3://idr0010/zarr .

$ sudo mv -f ./* /ngff/idr0010/
$ cd /ngff/
$ sudo chown -R omero-server idr0010/

Import metadata-only data

We can now perform a regular import as usual. Use a for loop to iterate through each plate in the directory instead of creating bulk import config, using name (removing .ome.zarr or .zarr for e.g. idr0036) so that data isn't named METADATA.ome.xml and Plate names match the original data. Could also add a target Screen or Dataset target (not shown) or move into container with webclient UI after import:

sudo -u omero-server -s
screen -S idr0010_ngff
source /opt/omero/server/venv3/bin/activate
export OMERODIR=/opt/omero/server/OMERO.server
omero login demo@localhost

cd /ngff/idr0010
for dir in *; do
  omero import --transfer=ln_s --depth=100 --name=${dir/.ome.zarr/} --skip=all $dir --file /tmp/$dir.log  --errs /tmp/$dir.err;
done

Update symlinks

Mount the s3 bucket on IDR server machine: (idr0125-pilot or idr0138-pilot)

sudo mkdir /idr0010 && sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0010 /idr0010

See https://github.com/IDR/idr-utils/pull/54 Checkout that branch of idr-utils (if not merged yet etc).

We need to specify the container (e.g. Screen, Plate, Dataset, Image or Fileset) and the path where the data is mounted: If the path to the data in each Fileset is e.g. filesetPrefix/plate1.zarr/.. and the path to each mounted plate is e.g. /path/to/plates/plate1.zarr we can run the following command to create 1 symlink for each plate from /ManagedRepository/filesetPrefix/plate1.zarr to /path/to/plates/plate1.zarr

The script also renders a single Image from each Fileset before updating symlinks, which avoids subsequent ResouceErrors. The script can be run repeatedly on the same data without issue, e.g. if it fails part-way through and needs a re-run to complete.

A --repo option with default value is /data/OMERO/ManagedRepository. Can also use --dry-run and --report options:

$ sudo -u omero-server -s
$ source /opt/omero/server/venv3/bin/activate
$ omero login demo@localhost
$ python idr-utils/scripts/managed_repo_symlinks.py Screen:123 /path/to/plates/ --report

Fileset: 5286929 /data/OMERO/ManagedRepository/demo_2/Blitz-0-Ice.ThreadPool.Server-6/2023-04/25/13-53-43.777/
fs_contents ['10-34.ome.zarr']
Link from /data/OMERO/ManagedRepository/demo_2/Blitz-0-Ice.ThreadPool.Server-6/2023-04/25/13-53-43.777/10-34.ome.zarr to /idr0010/zarr/10-34.ome.zarr
...

Swap Filesets

See https://github.com/IDR/idr-utils/pull/53 Checkout that branch of idr-utils (if not merged yet etc).

The first Object (Screen, Plate, Image, Fileset) is the original data that we want to update to use NGFF Fileset, and the second is the NGFF data we imported above. In the case of Screens, Filesets are swapped between pairs of Plates matched by name (you should check that Plate names match before running this script). The 3rd required argument is a file where you can write the sql commands that are required to update Pixels objects (we can't yet update these via the OMERO API). The script supports --dry-run and --report flags.

$ source /opt/omero/server/venv3/bin/activate
$ omero login demo@localhost
$ python idr-utils/scripts/swap_filesets.py Screen:1202 Screen:3204 /tmp/idr0012_filesetswap.sql --report

This will write a psql command for each Fileset that we then need to execute...

$ export OMERODIR=/opt/omero/server/OMERO.server
$ omero config get --show-password

# Use the password, host etc to run the sql file generated above...
$ PGPASSWORD=****** psql -U omero -d idr -h 192.168.10.102 -f /tmp/idr0012_filesetswap.sql

psql commands are 1 per Fileset and are like:

UPDATE pixels SET name = '.zattrs', path = 'demo_2/Blitz-0-Ice.ThreadPool.Server-16/2023-04/12/10-20-20.483/10x_images_plate_2.ome.zarr' where image in (select id from Image where fileset = 5286921);

You can then view Images from the original data which is now using an NGFF Fileset!

Cleanup

We can now delete the uk1s3 data and buckets created above for testing. The original Filesets will remain as "orphans".

will-moore commented 1 year ago

@joshmoore - Could you give me an idea of how much this workflow above is likely to change with the Fileset creation scripts you're working on?

Is the Swap Filesets section above going to remain the same (with the need to run a PSQL command as it is now)? Will your scripts use PSQL to generate the Filesets, or do everything via the OMERO API?

As discussed in IDR meeting today: if the Swap Filesets is the only point that we need PSQL, then there's more incentive to look for a fix for that, so that we can do everything via the OMERO API.

joshmoore commented 1 year ago

Sorry for missing the meeting (and I'll miss the next as well). Can you summarize the trade-offs that were discussed?

The Fileset creation I was working on would need the following steps (outside of the OMERO API):

  for each fileset:
    find the managed repo directory ("X")
    copy the converted data to "{X}_zarr" (or similar)
    run an SQL snippet for files, entries, etc.

The SQL could also be collected and run in one go.

At that point, you would have the equivalent of all your imports. Whether or not you perform the swap in the loop is probably a question for what stage we're at (next, production, etc.)

will-moore commented 1 year ago

@joshmoore Based on my investigation at https://github.com/IDR/idr-metadata/issues/660, it's not going to be feasible to remove the use of Pixels table for path/name lookup in OMERO, so we will need to set the PIxels path and name when we do the Fileset swap. Either that has to happen via a new method added to the OMERO API (probably not anytime soon) or we use SQL directly, which will allow us to progress just now...

Your summary sounds kinda similar to what I was trying originally, except that I was using the Python API and a script to create the Filesets instead of SQL. The problem I had was the creation of symlinks in the Managed Repo - See https://github.com/IDR/idr-metadata/issues/652#issuecomment-1497287633 So that's why I abandoned that approach and simply imported data to create Filesets.

Are you not having those kind of issues?

joshmoore commented 1 year ago

so we will need to set the PIxels path and name when we do the Fileset swap

:+1:

Are you not having those kind of issues?

Since everything is done outside of OMERO, I wouldn't think so.

will-moore commented 1 year ago

But having applied the SQL updates to OMERO, you need to have symlinks in place?

will-moore commented 1 year ago

@josh As a "clean" server to work on this, I've mounted idr0054 bucket on idr0138-pilot (where idr0054 images still have the original pattern file Fileset).

$ ssh -A -o 'ProxyCommand ssh idr-pilot.openmicroscopy.org -W %h:%p' idr0138-omeroreadwrite -L 1080:localhost:80
$ sudo mkdir /idr0054 && sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0054 /idr0054
$ ls /idr0054/zarr/
Tonsil 1.ome.zarr  Tonsil 2.ome.zarr  Tonsil 3.ome.zarr