Open will-moore opened 1 year ago
@joshmoore - Could you give me an idea of how much this workflow above is likely to change with the Fileset creation scripts you're working on?
Is the Swap Filesets
section above going to remain the same (with the need to run a PSQL command as it is now)?
Will your scripts use PSQL to generate the Filesets, or do everything via the OMERO API?
As discussed in IDR meeting today: if the Swap Filesets
is the only point that we need PSQL, then there's more incentive to look for a fix for that, so that we can do everything via the OMERO API.
Sorry for missing the meeting (and I'll miss the next as well). Can you summarize the trade-offs that were discussed?
The Fileset creation I was working on would need the following steps (outside of the OMERO API):
for each fileset:
find the managed repo directory ("X")
copy the converted data to "{X}_zarr" (or similar)
run an SQL snippet for files, entries, etc.
The SQL could also be collected and run in one go.
At that point, you would have the equivalent of all your imports. Whether or not you perform the swap in the loop is probably a question for what stage we're at (next, production, etc.)
@joshmoore Based on my investigation at https://github.com/IDR/idr-metadata/issues/660, it's not going to be feasible to remove the use of Pixels table for path/name lookup in OMERO, so we will need to set the PIxels path and name when we do the Fileset swap. Either that has to happen via a new method added to the OMERO API (probably not anytime soon) or we use SQL directly, which will allow us to progress just now...
Your summary sounds kinda similar to what I was trying originally, except that I was using the Python API and a script to create the Filesets instead of SQL. The problem I had was the creation of symlinks in the Managed Repo - See https://github.com/IDR/idr-metadata/issues/652#issuecomment-1497287633 So that's why I abandoned that approach and simply imported data to create Filesets.
Are you not having those kind of issues?
so we will need to set the PIxels path and name when we do the Fileset swap
:+1:
Are you not having those kind of issues?
Since everything is done outside of OMERO, I wouldn't think so.
But having applied the SQL updates to OMERO, you need to have symlinks in place?
@josh As a "clean" server to work on this, I've mounted idr0054
bucket on idr0138-pilot (where idr0054
images still have the original pattern file Fileset).
$ ssh -A -o 'ProxyCommand ssh idr-pilot.openmicroscopy.org -W %h:%p' idr0138-omeroreadwrite -L 1080:localhost:80
$ sudo mkdir /idr0054 && sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0054 /idr0054
$ ls /idr0054/zarr/
Tonsil 1.ome.zarr Tonsil 2.ome.zarr Tonsil 3.ome.zarr
NGFF generation
Generation takes place on
pilot-zarr1-dev
orpilot-zarr2-dev
machines.We need to generate NGFF data with https://github.com/IDR/bioformats2raw/releases/tag/v0.6.0-24 which has ZarrReader fixes, including those required for
.pattern
file data.Install bioformats2raw via conda:
This is actually just for getting the dependencies installed. Get the actual bioformats2raw from the link above and just unzip it into your home directory.
We need to generate NGFF Filesets under
/data
volume. Create a directory for the idr project and memo files (if it’s not already there), and change into the idr directory. For example for idr0051:Find out where the pattern, screen or companion files are. For example:
/nfs/bioimage/drop/idr0051-fulton-tailbudlightsheet/patterns/
Then run the conversion (using the bioformat2raw from above) in a
screen
(long running):NB: it may be useful to convert a single Fileset to zarr initially to determine the size of this on disk and to tell whether you have enough space to convert all the others at once. If not, might have to do a smaller number, zip and upload to BioStudies before deleting to make space available.
NB: please make sure that the
--memo-directory
specified here is writable by you.(
$i
is the pattern file,${i%.*}.ome.zarr
strips the .pattern file extension and adds.ome.zarr
; this should work for pattern, screen and also companion file extensions)Upload to EBI s3 for testing
Upload 1 or 2 Plates or Images to EBI's s3, so we can validate that the data can be viewed and imported on s3.
Create a bucket from local
aws
install: Once installedaws
just doaws configure
and enter Access key and Secret key - use defaults for other options.And update policy and CORS config as at https://github.com/IDR/deployment/blob/master/docs/object-store.md#policy (NB: replace idr0000 with e.g. idr0010 in the sample config etc)
Upload the data using
mc
, installed ondev
servers where data is generated:You should now be able to view and do some validation of the data with
ome-ngff-validator
andvizarr
. E.g. https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr0025/zarr/10x+images+plate+3.ome.zarrhttps://hms-dbmi.github.io/vizarr/?source=https://uk1s3.embassy.ebi.ac.uk/idr0025/zarr/10x+images+plate+3.ome.zarr
Submission to BioStudies
Once the NGFF data has been validated to your satisfaction, we can upload to BioStudies.
We need to create a
.zip
file for each.ome.zarr
Fileset. It can be useful where space is short to use-m
to move files into the zip and delete the original.For a single zarr, this looks like
$ zip -mr image.ome.zarr.zip image.ome.zarr
.Convert all the zarr Filesets for a study: E.g:
This will create zips in the same dir as the zarrs, but we want a directory that contains just the zips for upload...
Upload via Aspera, using the "secret directory". Login to BioStudies with the IDR account. Click on the
FTP/Aspera
button at https://www.ebi.ac.uk/biostudies/submissions/filesSome JavaScript you can run in browser console to get the file names in the submission table:
Create a
tsv
file that lists all the filesets for the submission with the first column namedFiles
. See https://www.ebi.ac.uk/bioimage-archive/help-file-list/E.g.
idr0054_files.tsv
:Upload this to the same location as above (via FTP or using the web UI). This is used to specify which files to be used in the submission. You should be able to see all the uploaded files at https://www.ebi.ac.uk/biostudies/submissions/files
Create a new submission at https://www.ebi.ac.uk/biostudies/submissions/
idr00xx NGFF...
link_type: image data resource
idr00XX_files.tsv
file list created above can be added to the submission under theStudy Component
section, which is at the bottom of the submission form.Once submitted, we need to ask EBI to process the submission, unzip each zarr and upload data to
s3
BioStudies will assign a uuid to each. They will provide a mapping from each zip file to uuid.zarr as csv:Spreadsheet for keeping track of the submissions status: https://docs.google.com/spreadsheets/d/1P3dn-uL9KzE9O7XAKhpL8fUMTG3LWedMgjzSdnfAjQ4/edit#gid=0
This needs to be used to create the necessary symlinks below.
If not already done, mount the
bia-integrator-data
bucket on the server machine and check to see if files are available:Make NGFF Filesets
Work In progress Use https://github.com/joshmoore/omero-mkngff to create filesets based on the mounted s3 NGFF Filesets.
See https://github.com/IDR/idr-utils/pull/56 as a script for generating inputs required for
omero-mkngff
.Validation
See https://github.com/IDR/idr-utils/pull/55 Checkout that branch of
idr-utils
(if not merged yet etc).The script there allows us to check the pixel data for the lowest resolution of each image in a study, validating that each plane is identical to the corresponding one in IDR.
This could take a while, so lets run as a screen...
Archived workflow below
The sections below were using a previous workflow (prior to the
omero-mkngff
approach)Make a metadata-only copy of the data
Since we want to import NGFF data without chunks, we need to create a copy of the data without chunks for import. The easiest way to do this is to use
aws
to sync the data, ignoring chunks.We want these to be owned by
omero-server
user in a location they can access, so they can be imported. Location at import time isn't too important.Import metadata-only data
We can now perform a regular import as usual. Use a for loop to iterate through each plate in the directory instead of creating bulk import config, using
name
(removing.ome.zarr
or.zarr
for e.g. idr0036) so that data isn't named METADATA.ome.xml and Plate names match the original data. Could also add a target Screen or Dataset target (not shown) or move into container with webclient UI after import:Update symlinks
Mount the s3 bucket on IDR server machine: (idr0125-pilot or idr0138-pilot)
See https://github.com/IDR/idr-utils/pull/54 Checkout that branch of
idr-utils
(if not merged yet etc).We need to specify the container (e.g. Screen, Plate, Dataset, Image or Fileset) and the path where the data is mounted: If the path to the data in each Fileset is e.g.
filesetPrefix/plate1.zarr/..
and the path to each mounted plate is e.g./path/to/plates/plate1.zarr
we can run the following command to create 1 symlink for each plate from/ManagedRepository/filesetPrefix/plate1.zarr
to/path/to/plates/plate1.zarr
The script also renders a single Image from each Fileset before updating symlinks, which avoids subsequent ResouceErrors. The script can be run repeatedly on the same data without issue, e.g. if it fails part-way through and needs a re-run to complete.
A
--repo
option with default value is/data/OMERO/ManagedRepository
. Can also use--dry-run
and--report
options:Swap Filesets
See https://github.com/IDR/idr-utils/pull/53 Checkout that branch of
idr-utils
(if not merged yet etc).The first Object (Screen, Plate, Image, Fileset) is the original data that we want to update to use NGFF Fileset, and the second is the NGFF data we imported above. In the case of Screens, Filesets are swapped between pairs of Plates matched by name (you should check that Plate names match before running this script). The 3rd required argument is a file where you can write the sql commands that are required to update Pixels objects (we can't yet update these via the OMERO API). The script supports
--dry-run
and--report
flags.This will write a psql command for each Fileset that we then need to execute...
psql commands are 1 per Fileset and are like:
You can then view Images from the original data which is now using an NGFF Fileset!
Cleanup
We can now delete the uk1s3 data and buckets created above for testing. The original Filesets will remain as "orphans".